Lung Cancer Methylation Markers

ABSTRACT

The present invention discloses a method of diagnosing lung cancer by using methylation specific markers from a set, having diagnostic power for lung cancer diagnosis and distinguishing lung cancer types in diverse samples; as well as methods to identify sets of prognostic and diagnostic value.

The present invention relates to cancer diagnostic methods and meanstherefor.

Neoplasms and cancer are abnormal growths of cells. Cancer cells rapidlyreproduce despite restriction of space, nutrients shared by other cells,or signals sent from the body to stop reproduction. Cancer cells areoften shaped differently from healthy cells, do not function properly,and can spread into many areas of the body. Abnormal growths of tissue,called tumors, are clusters of cells that are capable of growing anddividing uncontrollably. Tumors can be benign (noncancerous) ormalignant (cancerous). Benign tumors tend to grow slowly and do notspread. Malignant tumors can grow rapidly, invade and destroy nearbynormal tissues, and spread throughout the body. Malignant cancers can beboth locally invasive and metastatic. Locally invasive cancers caninvade the tissues surrounding it by sending out “fingers” of cancerouscells into the normal tissue. Metastatic cancers can send cells intoother tissues in the body, which may be distant from the original tumor.Cancers are classified according to the kind of fluid or tissue fromwhich they originate, or according to the location in the body wherethey first developed. All of these parameters can effectively have aninfluence on the cancer characteristics, development and progression andsubsequently also cancer treatment. Therefore, reliable methods toclassify a cancer state or cancer type, taking diverse parameters intoconsideration is desired. Since cancer is predominantly a geneticdisease, trying to classify cancers by genetic parameters is oneextensively studied route.

Extensive efforts have been undertaken to discover genes relevant fordiagnosis, prognosis and management of (cancerous)disease. MainlyRNA-expression studies have been used for screening to identify geneticbiomarkers. Over recent years it has been shown that changes in theDNA-methylation pattern of genes could be used as biomarkers for cancerdiagnostics. In concordance with the general strategy identifyingRNA-expression based biomarkers, the most convenient and prosperingapproach would start to identify marker candidates by genome-widescreening of methylation changes.

The most versatile genome-wide approaches up to now are using microarrayhybridization based techniques. Although studies have been undertaken atthe genomic level (and also the single-gene level) for elucidatingmethylation changes in diseased versus normal tissue, a comprehensivetest obtaining a good success rate for identifying biomarkers is yet notavailable.

Developing biomarkers for disease (especially cancer)-screening,-diagnosis, and -treatment was improved over the last decade by majoradvances of different technologies which have made it easier to discoverpotential biomarkers through high-throughput screens. Comparing the socalled “OMICs”-approaches like Genomics, Proteomics, Metabolomics, andderivates from those, Genomics is best developed and most widely usedfor biomarker identification. Because of the dynamic nature of RNAexpression and the ease of nucleic acid extraction and the detailedknowledge of the human genome, many studies have used RNA expressionprofiling for elucidation of class differences for distinguishing the“good” from the “bad” situation like diseased vs. healthy, or clinicaldifferences between groups of diseased patients. Over the yearsespecially microarray-based expression profiling has become a standardtool for research and some approaches are currently under clinicalvalidation for diagnostics. The plasticity over a broad dynamic range ofRNA expression levels is an advantage using RNA and also a prerequisiteof successful discrimination of classes, the low stability of RNA itselfis often seen as a drawback. Because stability of DNA is tremendouslyhigher than stability of RNA, DNA based markers are more promisingmarkers and expected to give robust assays for diagnostics. Many ofclinical markers in oncology are more or less DNA based and are wellestablished, e.g. cytogenetic analyses for diagnosis and classificationof different tumor-species. However, most of these markers are notaccessible using the cheap and efficient molecular-genetic PCR routinetests. This might be due to 1) the structural complexity of changes, 2)the inter-individual differences of these changes at the DNA-sequencelevel, and 3) the relatively low “quantitative” fold-changes of those“chromosomal” DNA changes. In comparison, RNA-expression changes rangeover some orders of magnitudes and these changes can be easily measuredusing genome-wide expression microarrays. These expression arrays arecovering the entire translated transcriptome by 20000-45000 probes.Elucidation of DNA changes via microarray techniques requires in generalmore probes depending on the requested resolution. Even order(s) ofmagnitude more probes are required than for standard expressionprofiling to cover the entire 3×10⁹ by human genome. For obtaining bestresolution when screening biomarkers at the structural genomic DNAlevel, today genomic tiling arrays and SNP-arrays are available.Although costs of these techniques analysing DNA have decreased overrecent years, for biomarker screening many samples have to be tested,and thus these tests are cost intensive.

Another option for obtaining stable DNA-based biomarkers relies onelucidation of the changes in the DNA methylation pattern of (malignant;neoplastic) disease. In the vertebrate genome methylation affectsexclusively the cytosine residues of CpG dinucleotides, which areclustered in CpG islands. CpG islands are often found associated withgene-promoter sequences, present in the 5′-untranslated gene regions andare per default unmethylated. In a very simplified view, an unmethylatedCpG island in the associated gene-promoter enables active transcription,but if methylated gene transcription is blocked. The DNA methylationpattern is tissue- and clone-specific and almost as stable as the DNAitself. It is also known that DNA-methylation is an early event intumorigenesis which would be of interest for early and initial diagnosisof disease. In principle screening for biomarkers suitable to answeringclinical questions including DNA-methylation based approaches would bemost successful when starting with a genome-wide approach.

Shames D et al. (PLOS Medicine 3(12) (2006): 2244-2262) identifiedmultiple genes that are methylated with high penetrance in primary lung,breast, colon and prostate cancers.

Sato N et al. (Cancer Res 63(13) (2003): 3735-3742) identified potentialtargets with aberrant methylation in pancreatic cancer. These genes weretested using a treatment with a demethylating agent(5-aza-2′-deoxycytidine and/or the histone deacetylase inhibitortrichostatin A) after which certain genes were increased transcribed.

Bibikova M et al. (Genome Res 16(3) (2006): 383-393) analysed lungcancer biopsy samples to identify methylated cpu sites to distinguishlung adenocarcinomas from normal lung tissues.

Yan P S et al. (Clin Cancer Res 6(4) (2000): 1432-1438) analysed CpGisland hypermethylation in primary breast tumor.

Cheng Y et al. (Genome Res 16(2) (2006): 282-289) discussed DNAmethylation in CpG islands associated with transcriptional silencing oftumor suppressor genes.

Ongenaert M et al. (Nucleic Acids Res 36 (2008) Database issueD842-D846) provided an overview over the methylation database “PubMeth”.

Microarray for human genome-wide hybridization testings are known, e.g.the Affymetrix Human Genome U133A Array (NCB1 Database, Acc. No. GLP96).

A substantial number of differentially methylated genes has beendiscovered over years rather by chance than by rationality. Albeit someof these methylation changes have the potential being useful markers fordifferentiation of specifically defined diagnostic questions, thesewould lack the power for successful delineation of various diagnosticconstellations. Thus, the rational approach would start at thegenomic-screen for distinguishing the “subtypes” and diagnostically,prognostically and even therapeutically challenging constellations.These rational expectations are the base of starting genomic (and alsoother—omics) screenings but do not warrant to obtain the maker panel forall clinical relevant constellations which should be distinguished. Thisis neither unreliable when thinking about a universal approach (e.g.transcriptomics) suitable to distinguish for instance all subtypes inall different malignancies by focusing on a single class oftarget-molecules (e.g. RNA). Rather all omics-approaches together wouldbe necessary and could help to improve diagnostics and finally patientmanagement.

Lung cancer is the third most common malignant neoplasm in the EUfollowing breast and colon cancers. Lung cancer presents the secondworst 5-year survival figures following pancreas. Thus, although itaccounts for 14% of all cancer diagnoses, lung cancer is responsible for22% of cancer deaths, indicating the poor prognosis of this tumour typeand the comparative lack of progress in treatment. Therapy is hamperedby the tendency for lung cancer to be diagnosed at a late stage, hencethe need to develop markers for early detection. Approximately 80% oflung cancer cases are of the non-small cell type (NSCLC), with squamouscell carcinoma and adenocarcinoma being the most frequent subtypes. Agoal of the present invention is to provide an alternative and morecost-efficient route to identify suitable markers for lung cancerdiagnostics.

Therefore, in a first aspect, the present invention provides a set ofnucleic acid primers or hybridization probes being specific for apotentially methylated region of marker genes being suitable to diagnoseor predict lung cancer or a lung cancer type, preferably being selectedfrom adenocarcinoma or squamous cell carcinoma, the marker genescomprising WT1, SALL3, TERT, ACTB, CPEB4. Preferably the set furthercomprises any one of the markers ABCB1, ACTB, AIM1L, APC, AREG, BMP2K,BOLL, C5AR1, C5orf4, CADM1, CDH13, CDX1, CLIC4, COL21A1, CPEB4, CXADR,DLX2, DNAJA4, DPH1, DRD2, EFS, ERBB2, ERCC1, ESR2, F2R, FAM43A, GABRA2,GAD1, GBP2, GDNF, GNA15, GNAS, HECW2, HIC1, HIST1H2AG, HLA-G, HOXA1,HOXA10, HSD17B4, HSPA2, IRAK2, ITGA4, JUB, KCNJ15, KCNQ1, KIF5B, KL,KRT14, KRT17, LAMC2, MAGEB2, MBD2, MSH4, MT1G, MT3, MTHFR, NEUROD1,NHLH2, NKX2-1, ONECUT2, PENK, PITX2, PLAGL1, PTTG1, PYCARD, RASSF1,S100A8, SALL3, SERPINB5, SERPINE1, SERPINI1, SFRP2, SLC25A31, SMAD3,SPARC, SPHK1, SRGN, TERT, THRB, TJP2, TMEFF2, TNFRSF10C, TNFRSF25, TP53,ZDHHC11, ZNF256, ZNF711, F2R, HOXA10, KL, SALL3, SPARC, TNFRSF25, WT1.

In a further aspect, the present invention provides a method ofdetermining a subset of diagnostic markers for potentially methylatedgenes from the genes of gene marker IDs 1-359 of table 1, suitable forthe diagnosis or prognosis of lung cancer or lung cancer type,comprising

-   -   a) obtaining data of the methylation status of at least 50        random genes selected from the 359 genes of gene marker IDs        1-359 in at least 1 sample, preferably 2, 3, 4 or at least 5        samples, of a confirmed lung cancer or lung cancer type state        and at least one sample of a lung cancer or lung cancer type        negative state,    -   b) correlating the results of the obtained methylation status        with the lung cancer or lung cancer type,    -   c) optionally repeating the obtaining a) and correlating b)        steps for a different combination of at least 50 random genes        selected from the 359 genes of gene marker IDs 1-359 and    -   d) selecting as many marker genes which in a classification        analysis have a p-value of less than 0.1 in a random-variance        t-test, or selecting as many marker genes which in a        classification analysis together have a correct lung cancer or        lung cancer type prediction of at least 70% in a        cross-validation test,        wherein the selected markers form the subset of diagnostic        markers.

The present invention provides a master set of 359 genetic markers whichhas been surprisingly found to be highly relevant for aberrantmethylation in the diagnosis or prognosis of lung cancer. It is possibleto determine a multitude of marker subsets from this master set whichcan be used to diagnose and differentiate between various lung cancer ortumor types, e.g. adenocarcinoma and squamous cell carcinoma.

The inventive 359 marker genes of table 1 (given in example 1 below)are: NHLH2, MTHFR, PRDM2, MLLT11, S100A9 (control), S100A9, S100A8(control), S100A8, S100A2, LMNA, DUSP23, LAMC2, PTGS2, MARK1, DUSP10,PARP1, PSEN2, CLIC4, RUNX3, AIM1L, SFN, RPA2, TP73, TP73 (p73), POU3F1,MUTYH, UQCRH, FAF1, TACSTD2, TNFR5F25, DIRAS3, MSH4, GBP2, GBP2, LRRC8C,F3, NANOS1, MGMT, EBF3, DCLRE1C, KIF5B, ZNF22, PGBD3, SRGN, GATA3, PTEN,MMS19, SFRP5, PGR, ATM, DRD2, CADM1, TEAD1, OPCML, CALCA, CTSD, MYOD1,IGF2, BDNF, CDKN1C, WT1, HRAS, DDB1, GSTP1, CCND1, EPS8L2, PIWIL4,CHST11, UNG, CCDC62, CDK2AP1, CHFR, GRIN2B, CCND2, VDR, B4GALNT3, NTF3,CYP27B1, GPR92, ERCC5, GJB2, BRCA2, KL, CCNA1, SMAD9, C13orf15, DGKH,DNAJC15, RB1, RCBTB2, PARP2, APEX1, JUB, JUB (control_NM_(—)198086),EFS, BAZ1A, NKX2-1, ESR2, HSPA2, PSEN1, PGF, MLH3, TSHR, THBS1, MYO5C,SMAD6, SMAD3, NOX5, DNAJA4, CRABP1, BCL2A1 (ID NO: 111), BCL2A1 (ID NO:112), BNC1, ARRDC4, SOCS1, ERCC4, NTHL1, PYCARD, AXIN1, CYLD, MT3, MT1A,MT1G, CDH1, CDH13, DPH1, HIC1, NEUROD2 (control), NEUROD2, ERBB2, KRT19,KRT14, KRT17, JUP, BRCA1, COL1A1, CACNA1G, PRKAR1A, SPHK1, SOX15, TP53(TP53_CGI23_(—1)kb), TP53 (TP53_both_CGIs_(—)1kb), TP53(TP53_CGI36_(—1)kb), TP53, NPTX1, SMAD2, DCC, MBD2, ONECUT2, BCL2,SERPINB5, SERPINB2 (control), SERPINB2, TYMS, LAMA1, SALL3, LDLR, STK11,PRDX2, RAD23A, GNA15, ZNF573, SPINT2, XRCC1, ERCC2, ERCC1, C5AR1(NM_(—)001736), C5AR1, POLD1, ZNF350, ZNF256, C3, XAB2, ZNF559, FHL2,IL1B, IL1B (control), PAX8, DDX18, GAD1, DLX2, ITGA4, NEUROD1, STAT1,TMEFF2, HECW2, BOLL, CASP8, SERPINE2, NCL, CYP1B1, TACSTD1, MSH2, MSH6,MXD1, JAG1, FOXA2, THBD, CTCFL, CTSZ, GATA5, CXADR, APP, TTC3, KCNJ15,RIPK4, TFF1, SEZ6L, TIMP3, BIK, VHL, IRAK2, PPARG, MBD4, RBP1, XPC, ATR,LXN, RARRES1, SERPINI1, CLDN1, FAM43A, IQCG, THRB, RARB, TGFBR2, MLH1,DLEC1, CTNNB1, ZNF502, SLC6A20, GPX1, RASSF1, FHIT, OGG1, PITX2,SLC25A31, FBXW7, SFRP2, CHRNA9, GABRA2, MSX1, IGFBP7, EREG, AREG, ANXA3,BMP2K, APC, HSD17B4 (ID No 249), HSD17B4 (ID No 250), LOX, TERT,NEUROG1, NR3C1, ADRB2, CDX1, SPARC, C5orf4, PTTG1, DUSP1, CPEB4,SCGB3A1, GDNF, ERCC8, F2R, F2RL1, VCAN, ZDHHC11, RHOBTB3, PLAGL1, SASH1,ULBP2, ESR1, RNASET2, DLL1, HIST1H2AG, HLA-G, MSH5, CDKN1A, TDRD6,COL21A1, DSP, SERPINE1 (ID No 283), SERPINE1 (ID No 284), FBXL13, NRCAM,TWIST1, HOXA1, HOXA10, SFRP4, IGFBP3, RPA3, ABCB1, TFPI2, COL1A2,ARPC1B, PILRB, GATA4, MAL2, DLC1, EPPK1, LZTS1, TNFRSF10B, TNFRSF10C,TNFRSF10D, TNFRSF10A, WRN, SFRP1, SNAI2, RDHE2, PENK, RDH10, TGFBR1,ZNF462, KLF4, CDKN2A, CDKN2B, AQP3, TPM2, TJP2 (ID NO 320), TJP2 (ID No321), PSAT1, DAPK1, SYK, XPA, ARMCX2, RHOXF1, FHL1, MAGEB2, TIMP1, AR,ZNF711, CD24, ABL1, ACTB, APC, CDH1 (Ecad1), CDH1 (Ecad2), FMR1, GNAS,H19, HIC1, IGF2, KCNQ1, GNAS, CDKN2A (P14), CDKN2B (P15), CDKN2A(P16_VL), PITXA, PITXB, PITXC, PITXD, RB1, SFRP2, SNRPN, XIST, IRF4,UNC13B, GSTP1. Table 1 lists some marker genes in the double such as fordifferent loci and control sequences. It should be understood that anymethylation specific region which is readily known to the skilled man inthe art from prior publications or available databases (e.g. PubMeth atwww.pubmeth.org) can be used according to the present invention. Ofcourse, double listed genes only need to be represented once in aninventive marker set (or set of probes or primers therefor) butpreferably a second marker, such as a control region is included (IDsgiven in the list above relate to the gene ID (or gene loci ID) given intable 1 of the example section).

One advantage making DNA methylation an attractive target for biomarkerdevelopment, is the fact that cell free methylated DNA can be detectedin body-fluids like serum, sputum, and urine from patients withcancerous neoplastic conditions and disease. For the purpose ofbiomarker screening, clinical samples have to be available. Forobtaining a sufficient number of samples with clinical and “outcome” orsurvival data, the first step would be using archived (tissue) samples.Preferably these materials should fulfill the requirements to obtainintact RNA and DNA, but most archives of clinical samples are storingformalin fixed paraffin embedded (FFPE) tissue blocks. This has been theclinic-pathological routine done over decades, but that fixed samplesare if at all only suitable for extraction of low quality of RNA. It hasnow been found that according to the present invention any such samples(as any comprising tumor DNA) can be used for the method of generatingan inventive subset, including fixed samples. The samples can be of lungtissue or any body fluid, e.g. sputum, bronchial lavage, or serumderived from peripheral blood or blood cells. Blood or blood derivedsamples preferably have reduced, e.g. <95%, or no leukocyte content butcomprise DNA of the cancerous cells or tumor. Preferably the inventivemarkers are of human genes. Preferably the samples are human samples.

The present invention provides a multiplexed methylation testing methodwhich 1) outperforms the “classification” success when compared togenomewide screenings via RNA-expression profiling, 2) enablesidentification of biomarkers for a wide variety of diseases, without theneed to prescreen candidate markers on a genomewide scale, and 3) issuitable for minimal invasive testing and 4) is easily scalable.

In contrast to the rational strategy for elucidation of biomarkers fordifferentiation of disease, the invention presents a targetedmultiplexed DNA-methylation test which outperforms genome-scaledapproaches (including RNA expression profiling) for disease diagnosis,classification, and prognosis.

The inventive set of 359 markers enables selection of a subset ofmarkers from this 359 set which is highly characteristic of lung cancerand a given lung cancer type. Further indicators differentiating betweencancer types or generally neoplastic conditions are e.g. benign (non (orlimited) proliferative) or malignant, metastatic or non-metastatictumors or nodules. It is sometimes possible to differentiate the sampletype from which the methylated DNA is isolated, e.g. urine, blood,tissue samples.

The present invention is suitable to differentiate diseases, inparticular neoplastic conditions, or tumor types. Diseases andneoplastic conditions should be understood in general including benignand malignant conditions. According to the present invention benignnodules (being at least the potential onset of malignancy) are includedin the definition of a disease. After the development of a malignancythe condition is a preferred disease to be diagnosed by the markersscreened for or used according to the present invention. The presentinvention is suitable to distinguish benign and malignant tumors (bothbeing considered a disease according to the present invention). Inparticular the invention can provide markers (and their diagnostic orprognostic use) distinguishing between a normal healthy state togetherwith a benign state on one hand and malignant states on the other hand.A diagnosis of lung cancer may include identifying the difference to anormal healthy state, e.g. the absence of any neoplastic nodules orcancerous cells. The present invention can also be used for prognosis oflung cancer, in particular a prediction of the progression of lungcancer or lung cancer type. A particularly preferred use of theinvention is to perform a diagnosis or prognosis of metastasising lungcancer (distinguished from non-metastasising conditions).

In the context of the present invention “prognosis”, “prediction” or“predicting” should not be understood in an absolute sense, as in acertainty that an individual will develop lung cancer or lung cancertype (including cancer progression), but as an increased risk to developcancer or the lung cancer type or of cancer progression. “Prognosis” isalso used in the context of predicting disease progression, inparticular to predict therapeutic results of a certain therapy of thedisease, in particular neoplastic conditions, or lung cancer types. Theprognosis of a therapy can e.g. be used to predict a chance of success(i.e. curing a disease) or chance of reducing the severity of thedisease to a certain level. As a general inventive concept, markersscreened for this purpose are preferably derived from sample data ofpatients treated according to the therapy to be predicted. The inventivemarker sets may also be used to monitor a patient for the emergence oftherapeutic results or positive disease progressions.

Some of the inventive, rationally selected markers have been foundmethylated in some instances. DNA methylation analyses in principle relyeither on bisulfite deamination-based methylation detection or on usingmethylation sensitive restriction enzymes. Preferably the restrictionenzyme-based strategy is used for elucidation of DNA-methylationchanges. Further methods to determine methylated DNA are e.g. given inEP 1 369 493 A1 or U.S. Pat. No. 6,605,432. Combining restrictiondigestion and multiplex PCR amplification with a targetedmicroarray-hybridization is a particular advantageous strategy toperform the inventive methylation test using the inventive marker sets(or subsets). A microarray-hybridization step can be used for readingout the PCR results. For the analysis of the hybridization datastatistical approaches for class comparisons and class prediction can beused. Such statistical methods are known from analysis of RNA-expressionderived microarray data.

If only limiting amounts of DNA were available for analyses anamplification protocol can be used enabling selective amplification ofthe methylated DNA fraction prior methylation testing. Subjecting theseamplicons to the methylation test, it was possible to successfullydistinguish DNA from sensitive cases from normal healthy controls. Inaddition it was possible to distinguish lung-cancer patients fromhealthy normal controls using DNA from serum by the inventivemethylation test upon preamplification. Both examples clearly illustratethat the inventive multiplexed methylation testing can be successfullyapplied when only limiting amounts of DNA are available. Thus, thisprinciple might be the preferred method for minimal invasive diagnostictesting.

In most situations several genes are necessary for classification.Although the 359 marker set test is not a genome-wide test and might beused as it is for diagnostic testing, running a subset ofmarkers—comprising the classifier which enables bestclassification—would be easier for routine applications. The test iseasily scalable. Thus, to test only the subset of markers, comprisingthe classifier, the selected subset of primers/probes could be applieddirectly to set up of the lower multiplexed test (or single PCR-test).Serum DNA can be used to classify or distinguish healthy patients fromindividuals with lung-tumors. Only the specific primers comprising thegene-classifier obtained from the methylation test may be set uptogether in multiplexed PCR reactions.

In summary the inventive methylation test is a suitable tool fordifferentiation and classification of neoplastic disease. This assay canbe used for diagnostic purposes and for defining biomarkers for clinicalrelevant issues to improve diagnosis of disease, and to classifypatients at risk for disease progression, thereby improving diseasetreatment and patient management.

The first step of the inventive method of generating a subset, step a)of obtaining data of the methylation status, preferably comprisesdetermining data of the methylation status, preferably by methylationspecific PCR analysis, methylation specific digestion analysis.Methylation specific digestion analysis can include either or both ofhybridization of suitable probes for detection to non-digested fragmentsor PCR amplification and detection of non-digested fragments.

The inventive selection can be made by any (known) classification methodto obtain a set of markers with the given diagnostic (or alsoprognostic) value to categorize a lung cancer or lung cancer type. Suchmethods include class comparisons wherein a specific p-value isselected, e.g. a p-value below 0.1, preferably below 0.08, morepreferred below 0.06, in particular preferred below 0.05, below 0.04,below 0.02, most preferred below 0.01.

Preferably the correlated results for each gene b) are rated by theircorrect correlation to lung cancer or lung cancer type positive state,preferably by p-value test or t-value test or F-test. Rated (best first,i.e. low p- or t-value) markers are the subsequently selected and addedto the subset until a certain diagnostic value is reached, e.g. theherein mentioned at least 70% (or more) correct classification of lungcancer or lung cancer type.

Class comparison procedures include identification of genes that weredifferentially methylated among the two classes using a random-variancet-test. The random-variance t-test is an improvement over the standardseparate t-test as it permits sharing information among genes aboutwithin-class variation without assuming that all genes have the samevariance (Wright G. W. and Simon R, Bioinformatics 19:2448-2455,2003).Genes were considered statistically significant if their p value wasless than a certain value, e.g. 0.1 or 0.01. A stringent significancethreshold can be used to limit the number of false positive findings. Aglobal test can also be performed to determine whether the expressionprofiles differed between the classes by permuting the labels of whicharrays corresponded to which classes. For each permutation, the p-valuescan be re-computed and the number of genes significant at the e.g. 0.01level can be noted. The proportion of the permutations that give atleast as many significant genes as with the actual data is then thesignificance level of the global test. If there are more than 2 classes,then the “F-test” instead of the “t-test” should be used.

Class Prediction includes the step of specifying a significance level tobe used for determining the genes that will be included in the subset.Genes that are differentially methylated between the classes at aunivariate parametric significance level less than the specifiedthreshold are included in the set. It doesn't matter whether thespecified significance level is small enough to exclude enough falsediscoveries. In some problems better prediction can be achieved by beingmore liberal about the gene sets used as features. The sets may be morebiologically interpretable and clinically applicable, however, if fewergenes are included. Similar to cross-validation, gene selection isrepeated for each training set created in the cross-validation process.That is for the purpose of providing an unbiased estimate of predictionerror. The final model and gene set for use with future data is the oneresulting from application of the gene selection and classifier fittingto the full dataset.

Models for utilizing gene methylation profile to predict the class offuture samples can also be used. These models may be based on theCompound Covariate Predictor (Radmacher et al. Journal of ComputationalBiology 9:505-511, 2002), Diagonal Linear Discriminant Analysis (Dudoitet al. Journal of the American Statistical Association 97:77-87, 2002),Nearest Neighbor Classification (also Dudoit et al.), and Support VectorMachines with linear kernel (Ramaswamy et al. PNAS USA 98:15149-54,2001). The models incorporated genes that were differentially methylatedamong genes at a given significance level (e.g. 0.01, 0.05 or 0.1) asassessed by the random variance t-test (Wright G. W. and Simon R.Bioinformatics 19:2448-2455,2003). The prediction error of each modelusing cross validation, preferably leave-one-out cross-validation (Simonet al. Journal of the National Cancer Institute 95:14-18, 2003), ispreferably estimated. For each leave-one-out cross-validation trainingset, the entire model building process was repeated, including the geneselection process. It may also be evaluated whether the cross-validatederror rate estimate for a model was significantly less than one wouldexpect from random prediction. The class labels can be randomly permutedand the entire leave-one-out cross-validation process is then repeated.The significance level is the proportion of the random permutations thatgave a cross-validated error rate no greater than the cross-validatederror rate obtained with the real methylation data. About 1000 randompermutations may be usually used.

Another classification method is the greedy-pairs method described by Boand Jonassen (Genome Biology 3(4):research0017.1-0017.11, 2002). Thegreedy-pairs approach starts with ranking all genes based on theirindividual t-scores on the training set. The procedure selects the bestranked gene g_(i) and finds the one other gene g_(j) that together withg_(i) provides the best discrimination using as a measure the distancebetween centroids of the two classes with regard to the two genes whenprojected to the diagonal linear discriminant axis. These two selectedgenes are then removed from the gene set and the procedure is repeatedon the remaining set until the specified number of genes have beenselected. This method attempts to select pairs of genes that work welltogether to discriminate the classes.

Furthermore, a binary tree classifier for utilizing gene methylationprofile can be used to predict the class of future samples. The firstnode of the tree incorporated a binary classifier that distinguished twosubsets of the total set of classes. The individual binary classifierswere based on the “Support Vector Machines” incorporating genes thatwere differentially expressed among genes at the significance level(e.g. 0.01, 0.05 or 0.1) as assessed by the random variance t-test(Wright G. W. and Simon R. Bioinformatics 19:2448-2455, 2003).Classifiers for all possible binary partitions are evaluated and thepartition selected was that for which the cross-validated predictionerror was minimum. The process is then repeated successively for the twosubsets of classes determined by the previous binary split. Theprediction error of the binary tree classifier can be estimated bycross-validating the entire tree building process. This overallcross-validation included re-selection of the optimal partitions at eachnode and re-selection of the genes used for each cross-validatedtraining set as described by Simon et al. (Simon et al. Journal of theNational Cancer Institute 95:14-18, 2003). 10-fold cross validation inwhich one-tenth of the samples is withheld can be utilized, a binarytree developed on the remaining 9/10 of the samples, and then classmembership is predicted for the 10% of the samples withheld. This isrepeated 10 times, each time withholding a different 10% of the samples.The samples are randomly partitioned into 10 test sets (Simon R and LamA. BRB-ArrayTools User Guide, version 3.2. Biometric Research Branch,National Cancer Institute).

Preferably the correlated results for each gene b) are rated by theircorrect correlation to lung cancer or lung cancer type positive state,preferably by p-value test. It is also possible to include a step inthat the genes are selected d) in order of their rating.

Independent from the method that is finally used to produce a subsetwith certain diagnostic or predictive value, the subset selectionpreferably results in a subset with at least 60%, preferably at least65%, at least 70%, at least 75%, at least 80% or even at least 85%, atleast 90%, at least 92%, at least 95%, in particular preferred 100%correct classification of test samples of lung cancer or lung cancertype. Such levels can be reached by repeating c) steps a) and b) of theinventive method, if necessary.

To prevent increase of the number of the members of the subset, onlymarker genes with at least a significance value of at most 0.1,preferably at most 0.8, even more preferred at most 0.6, at most 0.5, atmost 0.4, at most 0.2, or more preferred at most 0.01 are selected.

In particular preferred embodiments the at least 50 genes of step a) areat least 70, preferably at least 90, at least 100, at least 120, atleast 140, at least 160, at least 180, at least 190, at least 200, atleast 220, at least 240, at least 260, at least 280, at least 300, atleast 320, at least 340, at least 350 or all, genes.

Since the subset should be small it is preferred that not more than 60,or not more than 40, preferably not more than 30, in particularpreferred not more than 20, marker genes are selected in step d) for thesubset.

In a further aspect the present invention provides a method ofidentifying lung cancer or lung cancer type in a sample comprising DNAfrom a patient, comprising providing a diagnostic subset of markersidentified according to the method depicted above, determining themethylation status of the genes of the subset in the sample andcomparing the methylation status with the status of a confirmed lungcancer or lung cancer type positive and/or negative state, therebyidentifying lung cancer or lung cancer type in the sample.

The methylation status can be determined by any method known in the artincluding methylation dependent bisulfite deamination (and consequentlythe identification of mC—methylated C—changes by any known methods,including PCR and hybridization techniques). Preferably, the methylationstatus is determined by methylation specific PCR analysis, methylationspecific digestion analysis and either or both of hybridisation analysisto non-digested or digested fragments or PCR amplification analysis ofnon-digested fragments. The methylation status can also be determined byany probes suitable for determining the methylation status includingDNA, RNA, PNA, LNA probes which optionally may further includemethylation specific moieties.

As further explained below the methylation status can be particularlydetermined by using hybridisation probes or amplification primer(preferably PCR primers) specific for methylated regions of theinventive marker genes. Discrimination between methylated andnon-methylated genes, including the determination of the methylationamount or ratio, can be performed by using e.g. either one of thesetools.

The determination using only specific primers aims at specificallyamplifying methylated (or in the alternative non-methylated) DNA. Thiscan be facilitated by using (methylation dependent) bisulfitedeamination, methylation specific enzymes or by using methylationspecific nucleases to digest methylated (or alternativelynon-methylated) regions—and consequently only the non-methylated (oralternatively methylated) DNA is obtained. By using a genome chip (orsimply a gene chip including hybridization probes for all genes ofinterest such as all 359 marker genes), all amplification ornon-digested products are detected. I.e. discrimination betweenmethylated and non-methylated states as well as gene selection (theinventive set or subset) is before the step of detection on a chip.

Alternatively it is possible to use universal primers and amplify amultitude of potentially methylated genetic regions (including thegenetic markers of the invention) which are, as described eithermethylation specific amplified or digested, and then use a set ofhybridisation probes for the characteristic markers on e.g. a chip fordetection. I.e. gene selection is performed on the chip.

Either set, a set of probes or a set of primers, can be used to obtainthe relevant methylation data of the genes of the present invention. Ofcourse, both sets can be used.

The method according to the present invention may be performed by anymethod suitable for the detection of methylation of the marker genes. Inorder to provide a robust and optionally re-useable test format, thedetermination of the gene methylation is preferably performed with aDNA-chip, real-time PCR, or a combination thereof. The DNA chip can be acommercially available general gene chip (also comprising a number ofspots for the detection of genes not related to the present method) or achip specifically designed for the method according to the presentinvention (which predominantly comprises marker gene detection spots).

Preferably the methylated DNA of the sample is detected by a multiplexedhybridization reaction. In further embodiments a methylated DNA ispreamplified prior to hybridization, preferably also prior tomethylation specific amplification, or digestion. Preferably, also theamplification reaction is multiplexed (e.g. multiplex PCR).

The inventive methods (for the screening of subsets or for diagnosis orprognosis of lung cancer or lung cancer type) are particularly suitableto detect low amounts of methylated DNA of the inventive marker genes.Preferably the DNA amount in the sample is below 500 ng, below 400 ng,below 300 ng, below 200 ng, below 100 ng, below 50 ng or even below 25ng. The inventive method is particularly suitable to detect lowconcentrations of methylated DNA of the inventive marker genes.Preferably the DNA amount in the sample is below 500 ng, below 400 ng,below 300 ng, below 200 ng, below 100 ng, below 50 ng or even below 25ng, per ml sample.

In another aspect the present invention provides a subset comprising orconsisting of nucleic acid primers or hybridization probes beingspecific for a potentially methylated region of at least marker genesselected from a set of nucleic acid primers or hybridization probesbeing specific for a potentially methylated region of marker genes beingsuitable to diagnose or predict lung cancer or a lung cancer type,preferably being selected from adenocarcinoma or squamous cellcarcinoma, the marker genes comprising WT1, SALL3, TERT, ACTB, CPEB4 orany other subset selected from one of the following groups

-   -   a) WT1, DLX2, SALL3, TERT, PITX2, HOXA10, F2R, CPEB4, NHLH2,        SMAD3, ACTB, HOXA1, BOLL, APC, MT1G, PENK, SPARC, DNAJA4,        RASSF1, HLA-G, ERCC1, ONECUT2, APC, ABCB1, ZNF573, KCNJ15,        ZDHHC11, SFRP2, GDNF, PTTG1, SERPINI1, TNFRSF10C    -   b) WT1, PITX2, SALL3, F2R, DLX2, TERT, HOXA10, MSH4, NHLH2,        GNA15, PENK, RASSF1, BOLL, HOXA1, ONECUT2, ABCB1, SPARC, MT1G,        HSPA2, SFRP2, PYCARD, GAD1, C5orf4, C5AR1, GDNF, ZDHHC11,        SERPINE1, NKX2-1, PITX2, C5AR1, ZNF256, FAM43A, SFRP2, MT3,        SERPINE1, CLIC4, TNFRSF10C, GABRA2, MTHFR, ESR2, NEUROG1, PITX2,        PLAGL1, TMEFF2, PTTG1, CADM1, S100A8, EFS, JUB, ITGA4, MAGEB2,        ERBB2, SRGN, GNAS, TJP2, KCNJ15, SLC25A31, ZNF573, TNFRSF25,        APC, KCNQ1, LAMC2, SPHK1, DNAJA4, APC, MBD2, ERCC1, HLA-G,        CXADR, TP53, ACTB, KL, SMAD3, HIST1H2AG, CPEB4    -   c) WT1, DLX2, SALL3, TERT, TNFRSF25, ACTB, SMAD3, CPEB4    -   d) WT1, DLX2, SALL3, TERT, PITX2, TNFRSF25, KL, ACTB, SMAD3,        CPEB4    -   e) WT1, PITX2, SALL3, DLX2, TERT, HOXA10, RASSF1, SPARC, IRAK2,        ZNF711, DNAJA4, HLA-G, CXADR, TP53, ACTB, CPEB4    -   f) WT1, PITX2, SALL3, F2R, TERT, HOXA10, RASSF1, SPARC, IRAK2,        ZNF711, DRD2, DNAJA4, CXADR, TP53, ACTB, CPEB4    -   g) WT1, ACTB, DLX2, PITX2, SALL3, HOXA10, TERT, CPEB4, HLA-G,        SPARC, RASSF1, DNAJA4, CXADR, TP53, IRAK2, ZNF711    -   h) F2R, ZNF256, CDH13, SERPINB5, KRT14, DLX2, AREG, THRB,        HSD17B4, SPARC, HECW2, COL21A1    -   i) KL, HIST1H2AG, TJP2, SRGN, CDX1, TNFRSF25, APC, HIC1, APC,        GNA15, ACTB, WT1, KRT17, AIM1L, DPH1, PITX2, PITX2, KIF5B,        BMP2K, GBP2, NHLH2, GDNF, BOLL    -   j) WT1, DLX2, SALL3, TERT, PITX2, HOXA10, F2R, CPEB4, NHLH2,        SMAD3, ACTB, HOXA1, BOLL, APC, MT1G, PENK, SPARC, DNAJA4,        RASSF1, HLA-G, ERCC1, ONECUT2, APC, ABCB1, ZNF573, KCNJ15,        ZDHHC11, SFRP2, GDNF, PTTG1, SERPINI1, TNFRSF10C    -   k) HOXA10, NEUROD1    -   l) WT1, PITX2, SALL3, F2R, TERT, HOXA10, RASSF1, SPARC, IRAK2,        ZNF711, DRD2, DNAJA4, CXADR, TP53, ACTB, CPEB4, DLX2, TNFR5F25,        KL, SMAD3    -   m) TNFRSF25, SALL3, RASSF1, TERT, SPARC, F2R, HOXA10, ZNF711,        PITX2    -   n) SALL3, PITX2, SPARC, F2R, TERT, RASSF1, HOXA10, CXADR, KL    -   o) SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, KL    -   p) SALL3, PITX2, SPARC, F2R, HOXA10, DRD2, ACTB, DNAJA4, CXADR,        KL    -   q) SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, TNFRSF25,        DNAJA4, TP53, CXADR, KL    -   r) SPARC, SALL3, F2R, PITX2, RASSF1, HOXA10, TERT, KL, TNFRSF25    -   s) SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, KL, TNFR5F25,        CXADR    -   t) HOXA10, RASSF1, F2R

or

a set of at least 50%, preferably at least 60%, at least 70%, at least80%, at least 90%, 100% of the markers of anyone of the above a) to t).The present inventive set also includes sets with at least 50% of theabove markers for each set since it is also possible to substitute partsof these subsets being specific for—in the case of binaryconditions/differentiations—e.g. good or bad prognosis or distinguishbetween lung cancer or lung cancer types, wherein one part of the subsetpoints into one direction for a certain lung cancer type orcancer/differentiation. It is possible to further complement the 50%part of the set by additional markers specific for diagnosing lungcancer or determining the other part of the good or bad differentiationor differentiation between two lung cancer types. Methods to determinesuch complementing markers follow the general methods as outlinedherein.

Each of these marker subsets is particularly suitable to diagnose lungcancer or lung cancer type or distinguish between certain cancers,samples or cancer types in a methylation specific assay of these genes.

The inventive primers or probes may be of any nucleic acid, includingRNA, DNA, PNA (peptide nucleic acids), LNA (locked nucleic acids). Theprobes might further comprise methylation specific moieties.

The present invention provides a (master) set of 360 marker genes,further also specific gene locations by the PCR products of these geneswherein significant methylation can be detected, as well as subsetstherefrom with a certain diagnostic value to detect or diagnose lungcancer or distinguish lung cancer type(s). Preferably the set isoptimized for a lung cancer or a lung cancer type. Lung cancer typesinclude, without being limited thereto, adenocarcinoma and squamous cellcarcinoma. Further indicators differentiating between disease(s),including the diagnosis of any type of lung cancer or lung tumor, orbetween tumor type(s) are e.g. benign (non (or limited) proliferative)or malignant, metastatic or non-metastatic. The set can also beoptimized for a specific sample type in which the methylated DNA istested. Such samples include blood, urine, saliva, hair, skin, tissues,in particular tissues of the cancer origin mentioned above, inparticular lung tissue such as potentially affected or potentiallycancerous lung tissue, or serum, sputum, bronchial lavage. The sample mybe obtained from a patient to be diagnosed. In preferred embodiments thetest sample to be used in the method of identifying a subset is from thesame type as a sample to be used in the diagnosis.

In practice, probes specific for potentially aberrant methylated regionsare provided, which can then be used for the diagnostic method.

It is also possible to provide primers suitable for a specificamplification, like PCR, of these regions in order to perform adiagnostic test on the methylation state.

Such probes or primers are provided in the context of a setcorresponding to the inventive marker genes or marker gene loci as givenin table 1.

Such a set of primers or probes may have all 359 inventive markerspresent and can then be used for a multitude of different cancerdetection methods. Of course, not all markers would have to be used todiagnose a lung cancer or lung cancer type. It is also possible to usecertain subsets (or combinations thereof) with a limited number ofmarker probes or primers for diagnosis of certain categories of lungcancer.

Therefore, the present invention provides sets of primers or probescomprising primers or probes for any single marker subset or anycombination of marker subsets disclosed herein. In the following sets ofmarker genes should be understood to include sets of primer pairs andprobes therefor, which can e.g. be provided in a kit.

Set a, WT1, DLX2, SALL3, TERT, PITX2, HOXA10, F2R, CPEB4, NHLH2, SMAD3,ACTB, HOXA1, BOLL, APC, MT1G, PENK, SPARC, DNAJA4, RASSF1, HLA-G, ERCC1,ONECUT2, APC, ABCB1, ZNF573, KCNJ15, ZDHHC11, SFRP2, GDNF, PTTG1,SERPINI1, TNFRSF10C and sets with at least 50%, preferably at least 60%,at least 70%, at least 80% or at least 90% of these markers are inparticular suitable to detect lung cancer and to distinguish betweennormal lung tissue (non-cancerous) from lung tumor tissue.

Set b, WIT1, PITX2, SALL3, F2R, DLX2, TERT, HOXA10, MSH4, NHLH2, GNA15,PENK, RASSF1, BOLL, HOXA1, ONECUT2, ABCB1, SPARC, MT1G, HSPA2, SFRP2,PYCARD, GAD1, C5orf4, C5AR1, GDNF, ZDHHC11, SERPINE1, NKX2-1, PITX2,C5AR1, ZNF256, FAM43A, SFRP2, MT3, SERPINE1, CLIC4, TNFRSF10C, GABRA2,MTHFR, ESR2, NEUROG1, PITX2, PLAGL1, TMEFF2, PTTG1, CADM1, S100A8, EFS,JUB, ITGA4, MAGEB2, ERBB2, SRGN, GNAS, TJP2, KCNJ15, SLC25A31, ZNF573,TNFRSF25, APC, KCNQ1, LAMC2, SPHK1, DNAJA4, APC, MBD2, ERCC1, HLA-G,CXADR, TP53, ACTB, KL, SMAD3, HIST1H2AG, CPEB4 and sets with at least50%, preferably at least 60%, at least 70%, at least 80% or at least 90%of these markers are also suitable to detect lung cancer and todistinguish between normal lung tissue and lung tumor tissue. Thedistinction or diagnosis can be made by using any sample as describedabove, including serum, sputum, bronchial lavage.

Set c, WT1, DLX2, SALL3, TERT, TNFRSF25, ACTB, SMAD3, CPEB4 and setswith at least 50%, preferably at least 60%, at least 70%, at least 80%or at least 90% of these markers are suitable to detect lung cancer andto distinguish between normal lung tissue (non-cancerous) from lungtumor tissue. The distinction or diagnosis can be made by using anysample as described above, including serum, sputum, bronchial lavage.

Set d, WT1, DLX2, SALL3, TERT, PITX2, TNFRSF25, KL, ACTB, SMAD3, CPEB4and sets with at least 50%, preferably at least 60%, at least 70%, atleast 80% or at least 90% of these markers are in particular suitable todetect lung cancer and to distinguish between normal lung tissue(non-cancerous) from lung tumor tissue. The distinction or diagnosis canbe made by using any sample as described above, including serum, sputum,bronchial lavage.

Set e, WT1, PITX2, SALL3, DLX2, TERT, HOXA10, RASSF1, SPARC, IRAK2,ZNF711, DNAJA4, HLA-G, CXADR, TP53, ACTB, CPEB4 and sets with at least50%, preferably at least 60%, at least 70%, at least 80% or at least 90%of these markers are also suitable to detect lung cancer and todistinguish between normal lung tissue (non-cancerous) from lung tumortissue. The distinction or diagnosis can be made by using any sample asdescribed above, including serum, sputum, bronchial lavage.

Set f, WT1, PITX2, SALL3, F2R, TERT, HOXA10, RASSF1, SPARC, IRAK2,ZNF711, DRD2, DNAJA4, CXADR, TP53, ACTB, CPEB4 and sets with at least50%, preferably at least 60%, at least 70%, at least 80% or at least 90%of these markers can be used to detect lung cancer and to distinguishbetween normal lung tissue (non-cancerous) from lung tumor tissue. Thedistinction or diagnosis can be made by using any sample as describedabove, including serum, sputum, bronchial lavage.

Set g, WT1, ACTB, DLX2, PITX2, SALL3, HOXA10, TERT, CPEB4, HLA-G, SPARC,RASSF1, DNAJA4, CXADR, TP53, IRAK2, ZNF711 and sets with at least 50%,preferably at least 60%, at least 70%, at least 80% or at least 90% ofthese markers can be used to diagnose lung carcinoma, in particularusing blood samples, e.g. to distinguish blood from healthy persons fromtumor samples, including tumor tissue sample or blood from tumorpatients. The distinction or diagnosis can be made by using any sampleas described above, including serum, sputum, bronchial lavage.

Set h, F2R, ZNF256, CDH13, SERPINB5, KRT14, DLX2, AREG, THRB, HSD17B4,SPARC, HECW2, COL21A1 and sets with at least 50%, preferably at least60%, at least 70%, at least 80% or at least 90% of these markers can beused to diagnose lung cancer and distinguish the grade ofdifferentiation of poor, moderate and well predictions. The distinctionor diagnosis can be made by using any sample as described above,including serum, sputum, bronchial lavage.

Set i, KL, HIST1H2AG, TJP2, SRGN, CDX1, TNFRSF25, APC, HIC1, APC, GNA15,ACTB, WT1, KRT17, AIM1L, DPH1, PITX2, PITX2, KIF5B, BMP2K, GBP2, NHLH2,GDNF, BOLL and sets with at least 50%, preferably at least 60%, at least70%, at least 80% or at least 90% of these markers can be used todiagnose lung cancer and distinguish between malign states (inparticular adenocarcinoma and squamous cell carcinoma) together withlung tissue against healthy blood or serum samples. The distinction ordiagnosis can be made by using any sample as described above, includingserum, sputum, bronchial lavage.

Set j, WT1, DLX2, SALL3, TERT, PITX2, HOXA10, F2R, CPEB4, NHLH2, SMAD3,ACTB, HOXA1, BOLL, APC, MT1G, PENK, SPARC, DNAJA4, RASSF1, HLA-G, ERCC1,ONECUT2, APC, ABCB1, ZNF573, KCNJ15, ZDHHC11, SFRP2, GDNF, PTTG1,SERPINI1, TNFRSF10C and sets with at least 50%, preferably at least 60%,at least 70%, at least 80% or at least 90% of these markers can be usedto diagnose ,lung cancer and distinguish between malign states selectedfrom adenocarcinoma and squamous cell carcinoma from healthy lungtissue. The distinction or diagnosis can be made by using any sample asdescribed above, including serum, sputum, bronchial lavage.

Set k, HOXA10, NEUROD1 and/or either HOXA10 or NEUR001 can be used todiagnose lung cancer and further to distinguish between adenocarcinomafrom squamous cell carcinoma.

Set 1, WT1, PITX2, SALL3, F2R, TERT, HOXA10, RASSF1, SPARC, IRAK2,ZNF711, DRD2, DNAJA4, CXADR, TP53, ACTB, CPEB4, DLX2, TNFRSF25, KL,SMAD3 and sets with at least 50%, preferably at least 60%, at least 70%,at least 80% or at least 90% of these markers can be used to diagnoselung cancer and distinguish between cancerous lung tissue from healthylung tissue.

Set m, TNFRSF25, SALL3, RASSF1, TERT, SPARC, F2R, HOXA10, ZNF711, PITX2and sets with at least 50%, preferably at least 60%, at least 70%, atleast 80% or at least 90% of these markers can be used to diagnose lungcancer and distinguish between cancerous lung tissue from healthy lungtissue. The distinction or diagnosis can be made by using any sample asdescribed above, including serum, sputum, bronchial lavage.

Set n, SALL3, PITX2, SPARC, F2R, TERT, RASSF1, HOXA10, CXADR, KL andsets with at least 50%, preferably at least 60%, at least 70%, at least80% or at least 90% of these markers can be used to diagnose lung cancerand distinguish between cancerous lung tissue from healthy lung tissue.The distinction or diagnosis can be made by using any sample asdescribed above, including serum, sputum, bronchial lavage.

Set o, SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, KL and sets withat least 50%, preferably at least 60%, at least 70%, at least 80% or atleast 90% of these markers can be used to diagnose lung cancer anddistinguish between cancerous lung tissue from healthy lung tissue. Thedistinction or diagnosis can be made by using any sample as describedabove, including serum, sputum, bronchial lavage.

Set p, SALL3, PITX2, SPARC, F2R, HOXA10, DRD2, ACTB, DNAJA4, CXADR, KLand sets with at least 50%, preferably at least 60%, at least 70%, atleast 80% or at least 90% of these markers can be used to diagnose lungcancer and to distinguish between normal lung tissue (non-cancerous)from lung tumor tissue.

Set q, SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, TNFRSF25, DNAJA4,TP53, CXADR, KL and sets with at least 50%, preferably at least 60%, atleast 70%, at least 80% or at least 90% of these markers can be used todiagnose lung cancer and to distinguish between normal lung tissue(non-cancerous) from lung tumor tissue. The distinction or diagnosis canbe made by using any sample as described above, including serum, sputum,bronchial lavage.

Set r, SPARC, SALL3, F2R, PITX2, RASSF1, HOXA10, TERT, KL, TNFRSF25 andsets with at least 50%, preferably at least 60%, at least 70%, at least80% or at least 90% of these markers can be used to diagnose lungcancer, distinguish between adenocarcinoma, healthy lung tissue andsquamous cell carcinoma. The distinction or diagnosis can be made byusing any sample as described above, including serum, sputum, bronchiallavage.

Set s, SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, KL, TNFRSF25,CXADR and 50%, preferably at least 60%, at least 70%, at least 80% or atleast 90% of these markers can be used to diagnose lung cancer,distinguish adenocarcinoma and squamous cell carcinoma from healthy(benign) lung tissue. The distinction or diagnosis can be made by usingany sample as described above, including serum, sputum, bronchiallavage.

Set t, HOXA10, RASSF1, F2R and sets with at least 50%, preferably atleast 60%, at least 70%, at least 80% or at least 90% of these markerscan be used to diagnose lung cancer, distinguish between adenocarcinomaand squamous cell carcinoma. The distinction or diagnosis can be made byusing any sample as described above, including serum, sputum, bronchiallavage.

Also provided are combinations of the above mentioned subsets a) to t),in particular sets comprising markers of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15 or more of these subsets, preferably for the lung cancertype or preferably complete sets a) to t). One preferred set comprisesgene markers WT1, SALL3, TERT, ACTB and CPEB4. These markers are commonin a set for the diagnosis of lung cancer and suitable to distinguishnormal from lung cancer samples. This set preferably is supplemented bythe marker genes DLX2, TNFRSF25 or SMAD3. Furthermore, the inventive setmay comprise any one of the markers ABCB1, ACTB, AIM1L, APC, AREG,BMP2K, BOLL, C5AR1, C5orf4, CADM1, CDH13, CDX1, CLIC4, COL21A1, CPEB4,CXADR, DLX2, DNAJA4, DPH1, DRD2, EFS, ERBB2, ERCC1, ESR2, F2R, FAM43A,GABRA2, GAD1, GBP2, GDNF, GNA15, GNAS, HECW2, HIC1, HIST1H2AG, HLA-G,HOXA1, HOXA10, HSD17B4, HSPA2, IRAK2, ITGA4, JUB, KCNJ15, KCNQ1, KIF5B,KL, KRT14, KRT17, LAMC2, MAGEB2, MBD2, MSH4, MT1G, MT3, MTHFR, NEUROD1,NHLH2, NKX2-1, ONECUT2, PENK, PITX2, PLAGL1, PTTG1, PYCARD, RASSF1,S100A8, SALL3, SERPINB5, SERPINE1, SERPINI1, SFRP2, SLC25A31, SMAD3,SPARC, SPHK1, SRGN, TERT, THRB, TJP2, TMEFF2, TNFRSF10C, TNFRSF25, TP53,ZDHHC11, ZNF256, ZNF711, F2R, HOXA10, KL, SALL3, SPARC, TNFRSF25, WT1 orany combination thereof, in particular preferred are markers ACTB, APC,CPEB4, CXADR, DLX2, DNAJA4, F2R, HOXA10, KL, PITX2, RASSF1, SALL3,SPARC, TERT, (either TNFRSF10C or TNFRSF25 or both), WT1 or anycombination thereof, even more preferred are markers HOXA10, PITX2,RASSF1, SALL3, SPARC, TERT or any combination thereof, in a marker setaccording to the present invention, in particular as additional markersfor any one of sets a) to t), especially the marker set of markers WT1,SALL3, TERT, ACTB and CPEB4.

According to a preferred embodiment of the present invention, themethylation of at least two genes, preferably of at least three genes,especially of at least four genes, is determined. Specifically if thepresent invention is provided as an array test system, at least ten,especially at least fifteen genes, are preferred. In preferred testset-ups (for example in microarrays (“gene-chips”)) preferably at least20, even more preferred at least 30, especially at least 40 genes, areprovided as test markers. As mentioned above, these markers or the meansto test the markers can be provided in a set of probes or a set ofprimers, preferably both.

In a further embodiment the set comprises up to 100000, up to 90000, upto 80000, up to 70000, up to 60000 or 50000 probes or primer pairs (setof two primers for one amplification product), preferably up to 40000,up to 35000, up to 30000, up to 25000, up to 20000, up to 15000, up to10000, up to 7500, up to 5000, up to 3000, up to 2000, up to 1000, up to750, up to 500, up to 400, up to 300, or even more preferred up to 200probes or primers of any kind, particular in the case of immobilizedprobes on a solid surface such as a chip.

In certain embodiments the primer pairs and probes are specific for amethylated upstream region of the open reading frame of the markergenes.

Preferably the probes or primers are specific for a methylation in thegenetic regions defined by SEQ ID NOs 1081 to 1440, including theadjacent up to 500 base pairs, preferably up to 300, up to 200, up to100, up to 50 or up to 10 adjacent, corresponding to gene marker IDs 1to 359 of table 1, respectively. I.e. probes or primers of the inventiveset (including the full 359 set, as well as subsets and combinationsthereof) are specific for the regions and gene loci identified in table1, last column with reference to the sequence listing, SEQ ID NOs: 1081to 1440. As can be seen these SEQ IDs correspond to a certain gene, thelatter being a member of the inventive sets, in particular of thesubsets a) to t), e.g.

Examples of specific probes or primers are given in table 1 withreference to the sequence listing, SEQ ID NOs 1 to 1080, which formespecially preferred embodiments of the invention.

Preferably the set of the present invention comprises probes or primersfor at least one gene or gene product of the list according to table 1,wherein at least 10%, at least 15%, at least 20%, at least 25%, at least30%, at least 35%, at least 40%, at least 45%, at least 50%, at least55%, at least 60%, at least 65%, at least 70%, at least 75%, at least80%, at least 85%, at least 90%, at least 95%, especially preferred atleast 100%, of the total probes or primers are probes or primers forgenes of the list according to table 1. Preferably the set, inparticular in the case of a set of hybridization probes, is providedimmobilized on a solid surface, preferably a chip or in form of amicroarray. Since—according to current technology—detection means forgenes on a chip allow easier and more robust array design, gene chipsusing DNA molecules (for detection of methylated DNA in the sample) is apreferred embodiment of the present invention. Such gene chips alsoallow detection of a large number of nucleic acids.

Preferably the set is provided on a solid surface, in particular a chip,whereon the primers or probes can be immobilized. Solid surfaces orchips may be of any material suitable for the immobilization ofbiomolecules such as the moieties, including glass, modified glass(aldehyde modified) or metal chips.

The primers or probes can also be provided as such, includinglyophilized forms or being in solution, preferably with suitablebuffers. The probes and primers can of course be provided in a suitablecontainer, e.g. a tube or micro tube.

The present invention also relates to a method of identifying lungcancer or lung cancer type in a sample comprising DNA from a subject orpatient, comprising obtaining a set of nucleic acid primers (or primerpairs) or hybridization probes as defined above (comprising eachspecific subset or combinations thereof), determining the methylationstatus of the genes in the sample for which the members of the set arespecific for and comparing the methylation status of the genes with thestatus of a confirmed lung cancer or lung cancer type positive and/ornegative state, thereby identifying the lung cancer or lung cancer typein the sample. In general the inventive method has been described aboveand all preferred embodiments of such methods also apply to the methodusing the set provided herein.

The inventive marker set, including certain disclosed subsets andsubsets, which can be identified with the methods disclosed herein, aresuitable to diagnose lung cancer and distinguish between different lungcancer forms, in particular for diagnostic or prognostic uses.Preferably the markers used (e.g. by utilizing primers or probes of theinventive set) for the inventive diagnostic or prognostic method may beused in smaller amounts than e.g. in the set (or kit) or chip as such,which may be designed for more than one fine tuned diagnosis orprognosis. The markers used for the diagnostic or prognostic method maybe up to 100000, up to 90000, up to 80000, up to 70000, up to 60000 or50000, preferably up to 40000, up to 35000, up to 30000, up to 25000, upto 20,000, up to 15000, up to 10000, up to 7500, up to 5000, up to 3000,up to 2000, up to 1000, up to 750, up to 500, up to 400, up to 300, upto 200, up to 100, up to 80, or even more preferred up to 60. Theinventive set of marker primers or probes can be employed in chip(immobilised) based assays, products or methods, or in PCR based kits ormethods. Both, PCR and hybridisation (e.g. on a chip) can be used todetect methylated genes.

The inventive marker set, including certain disclosed subsets, which canbe identified with the methods disclosed herein, are suitable todistinguish between lung cancer from normal tissue, in particular fordiagnostic or prognostic uses.

The inventive marker set, including certain disclosed subsets, which canbe identified with the methods disclosed herein, are suitable todistinguish between adenocarcinoma from squamous cell carcinoma, inparticular for diagnostic or prognostic uses.

The present invention is further illustrated by the following examples,without being restricted thereto.

Figures:

FIG. 1: Cross-Validation ROC curve from the Bayesian Compound CovariatePredictor.

EXAMPLES Example 1 Gene List

TABLE 1 360 master set (with the 359 marker genes and one control) andsequence annotation hybridisation gene Gene alt. Gene probe primer 1(lp) primer 2 (rp) PCR product ID Symbol Symbol (SEQ ID NO:) (SEQ IDNO:) (SEQ ID NO:) (SEQ ID NO:) 1 NHLH2 NHLH2 1 361 721 1081 2 MTHFRMTHFR 2 362 722 1082 3 PRDM2 RIZ1 (PRDM2) 3 363 723 1083 4 MLLT11 MLLT114 364 724 1084 5 S100A9 control_S100A9 5 365 725 1085 6 S100A9 S100A9 6366 726 1086 7 S100A8 S100A8 7 367 727 1087 8 S100A8 control_S100A8 8368 728 1088 9 S100A2 S100A2 9 369 729 1089 10 LMNA LMNA 10 370 730 109011 DUSP23 DUSP23 11 371 731 1091 12 LAMC2 LAMC2 12 372 732 1092 13 PTGS2PTGS2 13 373 733 1093 14 MARK1 MARK1 14 374 734 1094 15 DUSP10 DUSP10 15375 735 1095 16 PARP1 PARP1 16 376 736 1096 17 PSEN2 PSEN2 17 377 7371097 18 CLIC4 CLIC4 18 378 738 1098 19 RUNX3 RUNX3 19 379 739 1099 20AIM1L NM_017977 20 380 740 1100 21 SFN SFN 21 381 741 1101 22 RPA2 RPA222 382 742 1102 23 TP73 TP73 23 383 743 1103 24 TP73 p73 24 384 744 110425 POU3F1 01.10.06 25 385 745 1105 26 MUTYH MUTYH 26 386 746 1106 27UQCRH UQCRH 27 387 747 1107 28 FAF1 FAF1 28 388 748 1108 29 TACSTD2TACSTD2 29 389 749 1109 30 TNFRSF25 TNFRSF25 30 390 750 1110 31 DIRAS3DIRAS3 31 391 751 1111 32 MSH4 MSH4 32 392 752 1112 33 GBP2 Control 33393 753 1113 34 GBP2 GBP2 34 394 754 1114 35 LRRC8C LRRC8C 35 395 7551115 36 F3 F3 36 396 756 1116 37 NANOS1 NM_001009553 37 397 757 1117 38MGMT MGMT 38 398 758 1118 39 EBF3 EBF3 39 399 759 1119 40 DCLRE1CDCLRE1C 40 400 760 1120 41 KIF5B KIF5B 41 401 761 1121 42 ZNF22 ZNF22 42402 762 1122 43 PGBD3 ERCC6 43 403 763 1123 44 SRGN Control 44 404 7641124 45 GATA3 GATA3 45 405 765 1125 46 PTEN PTEN 46 406 766 1126 47MMS19 MMS19L 47 407 767 1127 48 SFRP5 SFRP5 48 408 768 1128 49 PGR PGR49 409 769 1129 50 ATM ATM 50 410 770 1130 51 DRD2 DRD2 51 411 771 113152 CADM1 IGSF4 52 412 772 1132 53 TEAD1 Control 53 413 773 1133 54 OPCMLOPCML 54 414 774 1134 55 CALCA CALCA 55 415 775 1135 56 CTSD CTSD 56 416776 1136 57 MYOD1 MYOD1 57 417 777 1137 58 IGF2 IGF2 58 418 778 1138 59BDNF BDNF 59 419 779 1139 60 CDKN1C CDKN1C 60 420 780 1140 61 WT1 WT1 61421 781 1141 62 HRAS HRAS1 62 422 782 1142 63 DDB1 DDB1 63 423 783 114364 GSTP1 GSTP1 64 424 784 1144 65 CCND1 CCND1 65 425 785 1145 66 EPS8L2EPS8L2 66 426 786 1146 67 PIWIL4 PIWIL4 67 427 787 1147 68 CHST11 CHST1168 428 788 1148 69 UNG UNG 69 429 789 1149 70 CCDC62 CCDC62 70 430 7901150 71 CDK2AP1 CDK2AP1 71 431 791 1151 72 CHFR CHFR 72 432 792 1152 73GRIN2B GRIN2B 73 433 793 1153 74 CCND2 CCND2 74 434 794 1154 75 VDR VDR75 435 795 1155 76 B4GALNT3 control (wrong 76 436 796 1156 chr of HRAS1)77 NTF3 NTF3 77 437 797 1157 78 CYP27B1 CYP27B1 78 438 798 1158 79 GPR92GPR92 79 439 799 1159 80 ERCC5 ERCC5 80 440 800 1160 81 GJB2 GJB2 81 441801 1161 82 BRCA2 BRCA2 82 442 802 1162 83 KL KL 83 443 803 1163 84CCNA1 CCNA1 84 444 804 1164 85 SMAD9 SMAD9 85 445 805 1165 86 C13orf15RGC32 86 446 806 1166 87 DGKH DGKH 87 447 807 1167 88 DNAJC15 DNAJC15 88448 808 1168 89 RB1 RB1 89 449 809 1169 90 RCBTB2 RCBTB2 90 450 810 117091 PARP2 PARP2 91 451 811 1171 92 APEX1 APEX1 92 452 812 1172 93 JUB JUB93 453 813 1173 94 JUB control_NM_198086 94 454 814 1174 95 EFS EFS 95455 815 1175 96 BAZ1A BAZ1A 96 456 816 1176 97 NKX2-1 TITF1 97 457 8171177 98 ESR2 ESR2 98 458 818 1178 99 HSPA2 HSPA2 99 459 819 1179 100PSEN1 PSEN1 100 460 820 1180 101 PGF PGF 101 461 821 1181 102 MLH3 MLH3102 462 822 1182 103 TSHR TSHR 103 463 823 1183 104 THBS1 THBS1 104 464824 1184 105 MYO5C MYO5C 105 465 825 1185 106 SMAD6 SMAD6 106 466 8261186 107 SMAD3 SMAD3 107 467 827 1187 108 NOX5 SPESP1 108 468 828 1188109 DNAJA4 DNAJA4 109 469 829 1189 110 CRABP1 CRABP1 110 470 830 1190111 BCL2A1 BCL2A1 111 471 831 1191 112 BCL2A1 BCL2A1 112 472 832 1192113 BNC1 BNC1 113 473 833 1193 114 ARRDC4 ARRDC4 114 474 834 1194 115SOCS1 SOCS1 115 475 835 1195 116 ERCC4 ERCC4 116 476 836 1196 117 NTHL1NTHL1 117 477 837 1197 118 PYCARD PYCARD 118 478 838 1198 119 AXIN1AXIN1 119 479 839 1199 120 CYLD NM_015247 120 480 840 1200 121 MT3 MT3121 481 841 1201 122 MT1A MT1A 122 482 842 1202 123 MT1G MT1G 123 483843 1203 124 CDH1 CDH1 124 484 844 1204 125 CDH13 CDH13 125 485 845 1205126 DPH1 DPH1 126 486 846 1206 127 HIC1 HIC1 127 487 847 1207 128NEUROD2 control_NEUROD2 128 488 848 1208 129 NEUROD2 NEUROD2 129 489 8491209 130 ERBB2 ERBB2 130 490 850 1210 131 KRT19 KRT19 131 491 851 1211132 KRT14 KRT14 132 492 852 1212 133 KRT17 KRT17 133 493 853 1213 134JUP JUP 134 494 854 1214 135 BRCA1 BRCA1 135 495 855 1215 136 COL1A1COL1A1 136 496 856 1216 137 CACNA1G CACNA1G 137 497 857 1217 138 PRKAR1APRKAR1A 138 498 858 1218 139 SPHK1 SPHK1 139 499 859 1219 140 SOX15SOX15 140 500 860 1220 141 TP53 TP53_CGI23_1kb 141 501 861 1221 142 TP53TP53_bothCGIs_1kb 142 502 862 1222 143 TP53 TP53_CGI36_1kb 143 503 8631223 144 TP53 TP53 144 504 864 1224 145 NPTX1 NPTX1 145 505 865 1225 146SMAD2 SMAD2 146 506 866 1226 147 DCC DCC 147 507 867 1227 148 MBD2 MBD2148 508 868 1228 149 ONECUT2 ONECUT2 149 509 869 1229 150 BCL2 BCL2 150510 870 1230 151 SERPINB5 SERPINB5 151 511 871 1231 152 SERPINB2 Control152 512 872 1232 153 SERPINB2 SERPINB2 153 513 873 1233 154 TYMS TYMS154 514 874 1234 155 LAMA1 LAMA1 155 515 875 1235 156 SALL3 SALL3 156516 876 1236 157 LDLR LDLR 157 517 877 1237 158 STK11 STK11 158 518 8781238 159 PRDX2 PRDX2 159 519 879 1239 160 RAD23A RAD23A 160 520 880 1240161 GNA15 GNA15 161 521 881 1241 162 ZNF573 ZNF573 162 522 882 1242 163SPINT2 SPINT2 163 523 883 1243 164 XRCC1 XRCC1 164 524 884 1244 165ERCC2 ERCC2 165 525 885 1245 166 ERCC1 ERCC1 166 526 886 1246 167 C5AR1NM_001736 167 527 887 1247 168 C5AR1 C5AR1 168 528 888 1248 169 POLD1POLD1 169 529 889 1249 170 ZNF350 ZNF350 170 530 890 1250 171 ZNF256ZNF256 171 531 891 1251 172 C3 C3 172 532 892 1252 173 XAB2 XAB2 173 533893 1253 174 ZNF559 ZNF559 174 534 894 1254 175 FHL2 FHL2 175 535 8951255 176 IL1B IL1B 176 536 896 1256 177 IL1B control_IL1B 177 537 8971257 178 PAX8 PAX8 178 538 898 1258 179 DDX18 DDX18 179 539 899 1259 180GAD1 GAD1 180 540 900 1260 181 DLX2 DLX2 181 541 901 1261 182 ITGA4ITGA4 182 542 902 1262 183 NEUROD1 NEUROD1 183 543 903 1263 184 STAT1STAT1 184 544 904 1264 185 TMEFF2 TMEFF2 185 545 905 1265 186 HECW2HECW2 186 546 906 1266 187 BOLL BOLL 187 547 907 1267 188 CASP8 CASP8188 548 908 1268 189 SERPINE2 SERPINE2 189 549 909 1269 190 NCL NCL 190550 910 1270 191 CYP1B1 CYP1B1 191 551 911 1271 192 TACSTD1 TACSTD1 192552 912 1272 193 MSH2 MSH2 193 553 913 1273 194 MSH6 MSH6 194 554 9141274 195 MXD1 MXD1 195 555 915 1275 196 JAG1 JAG1 196 556 916 1276 197FOXA2 FOXA2 197 557 917 1277 198 THBD THBD 198 558 918 1278 199 CTCFLBORIS 199 559 919 1279 200 CTSZ CTSZ 200 560 920 1280 201 GATA5 GATA5201 561 921 1281 202 CXADR CXADR 202 562 922 1282 203 APP APP 203 563923 1283 204 TTC3 TTC3 204 564 924 1284 205 KCNJ15 Control 205 565 9251285 206 RIPK4 RIPK4 206 566 926 1286 207 TFF1 TFF1 207 567 927 1287 208SEZ6L SEZ6L 208 568 928 1288 209 TIMP3 TIMP3 209 569 929 1289 210 BIKBIK 210 570 930 1290 211 VHL VHL 211 571 931 1291 212 IRAK2 IRAK2 212572 932 1292 213 PPARG PPARG 213 573 933 1293 214 MBD4 MBD4 214 574 9341294 215 RBP1 RBP1 215 575 935 1295 216 XPC XPC 216 576 936 1296 217 ATRATR 217 577 937 1297 218 LXN LXN 218 578 938 1298 219 RARRES1 RARRES1219 579 939 1299 220 SERPINI1 SERPINI1 220 580 940 1300 221 CLDN1 CLDN1221 581 941 1301 222 FAM43A FAM43A 222 582 942 1302 223 IQCG IQCG 223583 943 1303 224 THRB THRB 224 584 944 1304 225 RARB RARB 225 585 9451305 226 TGFBR2 TGFBR2 226 586 946 1306 227 MLH1 MLH1 227 587 947 1307228 DLEC1 DLEC1 228 588 948 1308 229 CTNNB1 CTNNB1 229 589 949 1309 230ZNF502 ZNF502 230 590 950 1310 231 SLC6A20 SLC6A20 231 591 951 1311 232GPX1 GPX1 232 592 952 1312 233 RASSF1 RASSF1A 233 593 953 1313 234 FHITFHIT 234 594 954 1314 235 OGG1 OGG1 235 595 955 1315 236 PITX2 PITX2 236596 956 1316 237 SLC25A31 SLC25A31 237 597 957 1317 238 FBXW7 FBXW7 238598 958 1318 239 SFRP2 SFRP2 239 599 959 1319 240 CHRNA9 CHRNA9 240 600960 1320 241 GABRA2 GABRA2 241 601 961 1321 242 MSX1 MSX1 242 602 9621322 243 IGFBP7 IGFBP7 243 603 963 1323 244 EREG EREG 244 604 964 1324245 AREG AREG 245 605 965 1325 246 ANXA3 ANXA3 246 606 966 1326 247BMP2K BMP2K 247 607 967 1327 248 APC APC 248 608 968 1328 249 HSD17B4HSD17B4 249 609 969 1329 250 HSD17B4 HSD17B4 250 610 970 1330 251 LOXLOX 251 611 971 1331 252 TERT TERT 252 612 972 1332 253 NEUROG1 NEUROG1253 613 973 1333 254 NR3C1 NR3C1 254 614 974 1334 255 ADRB2 ADRB2 255615 975 1335 256 CDX1 CDX1 256 616 976 1336 257 SPARC SPARC 257 617 9771337 258 C5orf4 Control 258 618 978 1338 259 PTTG1 PTTG1 259 619 9791339 260 DUSP1 DUSP1 260 620 980 1340 261 CPEB4 CPEB4 261 621 981 1341262 SCGB3A1 SCGB3A1 262 622 982 1342 263 GDNF GDNF 263 623 983 1343 264ERCC8 ERCC8 264 624 984 1344 265 F2R F2R 265 625 985 1345 266 F2RL1F2RL1 266 626 986 1346 267 VCAN CSPG2 267 627 987 1347 268 ZDHHC11ZDHHC11 268 628 988 1348 269 RHOBTB3 RHOBTB3 269 629 989 1349 270 PLAGL1PLAGL1 270 630 990 1350 271 SASH1 SASH1 271 631 991 1351 272 ULBP2 ULBP2272 632 992 1352 273 ESR1 ESR1 273 633 993 1353 274 RNASET2 RNASET2 274634 994 1354 275 DLL1 DLL1 275 635 995 1355 276 HIST1H2AG HIST1H2AG 276636 996 1356 277 HLA-G HLA-G 277 637 997 1357 278 MSH5 MSH5 278 638 9981358 279 CDKN1A CDKN1A 279 639 999 1359 280 TDRD6 TDRD6 280 640 10001360 281 COL21A1 COL21A1 281 641 1001 1361 282 DSP DSP 282 642 1002 1362283 SERPINE1 SERPINE1 283 643 1003 1363 284 SERPINE1 SERPINE1 284 6441004 1364 285 FBXL13 FBXL13 285 645 1005 1365 286 NRCAM NRCAM 286 6461006 1366 287 TWIST1 TWIST1 287 647 1007 1367 288 HOXA1 HOXA1 288 6481008 1368 289 HOXA10 HOXA10 289 649 1009 1369 290 SFRP4 SFRP4 290 6501010 1370 291 IGFBP3 IGFBP3 291 651 1011 1371 292 RPA3 RPA3 292 652 10121372 293 ABCB1 ABCB1 293 653 1013 1373 294 TFPI2 TFPI2 294 654 1014 1374295 COL1A2 COL1A2 295 655 1015 1375 296 ARPC1B ARPC1B 296 656 1016 1376297 PILRB PILRB 297 657 1017 1377 298 GATA4 GATA4 298 658 1018 1378 299MAL2 NM_052886 299 659 1019 1379 300 DLC1 DLC1 300 660 1020 1380 301EPPK1 NM_031308 301 661 1021 1381 302 LZTS1 LZTS1 302 662 1022 1382 303TNFRSF10B TNFRSF10B 303 663 1023 1383 304 TNFRSF10C TNFRSF10C 304 6641024 1384 305 TNFRSF10D TNFRSF10D 305 665 1025 1385 306 TNFRSF10ATNFRSF10A 306 666 1026 1386 307 WRN WRN 307 667 1027 1387 308 SFRP1SFRP1 308 668 1028 1388 309 SNAI2 SNAI2 309 669 1029 1389 310 RDHE2RDHE2 310 670 1030 1390 311 PENK PENK 311 671 1031 1391 312 RDH10 RDH10312 672 1032 1392 313 TGFBR1 TGFBR1 313 673 1033 1393 314 ZNF462 ZNF462314 674 1034 1394 315 KLF4 KLF4 315 675 1035 1395 316 CDKN2A p14_CDKN2A316 676 1036 1396 317 CDKN2B CDKN2B 317 677 1037 1397 318 AQP3 AQP3 318678 1038 1398 319 TPM2 TPM2 319 679 1039 1399 320 TJP2 TJP2 320 680 10401400 321 TJP2 TJP2 321 681 1041 1401 322 PSAT1 PSAT1 322 682 1042 1402323 DAPK1 DAPK1 323 683 1043 1403 324 SYK SYK 324 684 1044 1404 325 XPAXPA 325 685 1045 1405 326 ARMCX2 ARMCX2 326 686 1046 1406 327 RHOXF1OTEX 327 687 1047 1407 328 FHL1 FHL1 328 688 1048 1408 329 MAGEB2 MAGEB2329 689 1049 1409 330 TIMP1 TIMP1 330 690 1050 1410 331 AR AR_humara 331691 1051 1411 332 ZNF711 ZNF6 332 692 1052 1412 333 CD24 CD24 333 6931053 1413 334 ABL1 ABL 334 694 1054 1414 335 ACTB Aktin_VL 335 695 10551415 336 APC APC 336 696 1056 1416 337 CDH1 Ecad1 337 697 1057 1417 338CDH1 Ecad2 338 698 1058 1418 339 FMR1 FX 339 699 1059 1419 340 GNASGNASexAB 340 700 1060 1420 341 H19 H19 341 701 1061 1421 342 HIC1 Igf2342 702 1062 1422 343 IGF2 Igf2 343 703 1063 1423 344 KCNQ1 LIT1 344 7041064 1424 345 GNAS NESP55 345 705 1065 1425 346 CDKN2A P14 346 706 10661426 347 CDKN2B P15 347 707 1067 1427 348 CDKN2A P16_VL 348 708 10681428 349 PITX2 PitxA 349 709 1069 1429 350 PITX2 PitxB 350 710 1070 1430351 PITX2 PitxC 351 711 1071 1431 352 PITX2 PitxD 352 712 1072 1432 353RB1 Rb 353 713 1073 1433 354 SFRP2 SFRP2_VL 354 714 1074 1434 355 SNRPNSNRPN 355 715 1075 1435 356 XIST XIST 356 716 1076 1436 357 IRF4chr6_control 357 717 1077 1437 358 UNC13B chr9_control 358 718 1078 1438359 GSTP1 GSTP1 360 720 1080 1440 360 Lamda lambda_PCR 359 719 1079 1439(control)

Example 2 Samples

Samples from solid tumors were derived from initial surgical resectionof primary tumors. Tumor tissue sections were derived fromhistopathology and histopathological data as well clinical data weremonitored over the time of clinical management of the patients and/orcollected from patient reports in the study center. Anonymised data andDNA were provided.

Example 3 Principle of the Assay and Design

The invention assay is a multiplexed assay for DNA methylation testingof up to (or even more than) 360 methylation candidate markers, enablingconvenient methylation analyses for tumor-marker definition. In its bestmode the test is a combined multiplex-PCR and microarray hybridizationtechnique for multiplexed methylation testing. The inventive markergenes, PCR primer sequences, hybridization probe sequences and expectedPCR products are given in table 1, above.

Targeting hypermethylated DNA regions in the inventive marker genes inseveral neoplasias, methylation analysis is performed via methylationdependent restriction enzyme (MSRE) digestion of 500 ng of starting DNA.A combination of several MSREs warrants complete digestion ofunmethylated DNA. All targeted DNA regions have been selected in thatway that sequences containing multiple MSRE sites are flanked bymethylation independent restriction enzyme sites. This strategy enablespre-amplification of the methylated DNA fraction before methylationanalyses. Thus, the design and pre-amplification would enablemethylation testing on serum, urine, stool etc. when DNA is limiting.

When testing DNA without pre-amplification upon digestion of 500 ng themethylated DNA fraction is amplified within 16 multiplex PCRs anddetected via microarray hybridization. Within these 16 multiplex-PCRreactions 360 different human DNA products can be amplified. From theseabout 20 amplicons serve as digestion & amplification controls and areeither derived from known differentially methylated human DNA regions,or from several regions without any sites of MSREs used in this system.The primer set (every reverse primer is biotinylated) used is targeting347 different sites located in the 5′UTR of 323 gene regions.

After PCR amplicons are pooled and positives are detected usingstrepavidin-Cy3 via microarray hybridization. Although the meltingtemperature of CpG rich DNA is very high, primer and probe-design aswell as hybridization conditions have been optimized, thus this assayenables unequivocal multiplexed methylation testing of human DNAsamples. The assay has been designed such that 24 samples can be run inparallel using 384 well PCR plates.

Handling of many DNA samples in several plates in parallel can be easilyperformed enabling completion of analyses within 1-2 days.

The entire procedure provides the user to setup a specific PCR test andsubsequent gel-based or hybridization-based testing of selected markersusing single primer-pairs or primer-subsets as provided herein oridentified by the inventive method from the 360 marker set.

Example 4 MSRE Digestion of DNA

MSRE digestion of DNA (about 500 ng) was performed at 37° C. over nightin a volume of 30 μl in 1× Tango-restriction enzyme digestion buffer(MBI Fermentas) using 8 units of each MSREs Acil (New England Biolabs),Hin 6 I and Hpa II (both from MBI Fermentas). Digestions were stopped byheat inactivation (10 min, 75° C.) and subjected to PCR amplification.

Example 5 PCR Amplification

An aliquot of 20 μl MSRE digested DNA (or in case of preamplification ofmethylated DNA—see below—about 500 ng were added in a volume of 20 μl)was added to 280 μl of PCR-Premix (without primers). Premix consisted ofall reagents obtaining a final concentration of 1× HotStarTaq Buffer(Qiagen); 160 μM dNTPs, 5% DMSO and 0.6 U Hot Firepol Taq (SolisBiodyne) per 20 μl reaction. Alternatively an equal amount of HotStarTaq(Qiagen) could be used. Eighteen (18) μl of the Pre-Mix includingdigested DNA were aliquoted in 16 0.2 ml PCR tubes and to each PCR tube2 μl of each primer-premix 1-16 (containing 0.83 pmol/μl of each primer)were added. PCR reactions were amplified using a thermal cycling profileof 15 min/95° C. and 40 cycles of each 40 sec/95° C., 40 sec/65° C., 1min 20 sec/72° C. and a final elongation of 7 min/72° C., then reactionswere cooled. After amplification the 16 different mutiplex-PCR ampliconsfrom each DNA sample were pooled. Successful amplification wascontrolled using 10 μl of the pooled 16 different PCR reactions persample. Positive amplification obtained a smear in the range of 100-300bp on EtBr stained agarose gels; negative amplification controls mustnot show a smear in this range.

Example 6 Microarray Hybridization and Detection:

Microarrays with the probes of the 360 marker set are blocked for 30 minin 3M Urea containing 0.1% SDS, at room temperature submerged in astirred choplin char. After blocking slides are washed in 0.1×SSC/0.2%SDS for 5 min, dipped into water and dried by centrifugation.

The PCR-amplicon-pool of each sample is mixed with an equal amount of 2×hybridization buffer (7×SSC, 0.6% SDS, 50% formamide), desaturated for 5min at 95° C. and held at 70° C. until loading an aliqout of 100 μl ontoan array covered by a gasket slide (Agilent). Arrays are hybridizedunder maximum speed of rotation in an Agilent-hybridization oven for 16h at 52° C. After removal of gasket-slides microarray-slides are washedat room temperature in wash-solution I (1×SSC,0.2% SDS) for 5 min andwash solution II (0.1×SSC, 0.2% SDS) for 5 min, and a final wash bydipping the slides 3 times into wash solution III (0.1×SSC), the slidesare dried by centrifugation.

For detection of hybridized biotinylated PCR amplicons,streptavidin-Cy3-conjugate (Caltag Laboratories) is diluted 1:400 inPBST-MP (1× PBS, 0.1% Tween 20; 1% skimmed dry milk powder [Sucofin;Germany]), pipetted onto microarrays covered with a coverslip andincubated 30 min at room temperature in the dark. Then coverslips arewashed off from the slides using PBST (1× PBS, 0.1% Tween 20) and thenslides are washed in fresh PBST for 5 min, rinsed with water and driedby centrifugation.

Example 7 DNA Preamplification for Methylation Profiling (Optional)

In many situations DNA amount is limited. Although the inventivemethylation test is performing well with low amounts of DNA (see above),especially minimal invasive testing using cell free DNA from serum,stool, urine, and other body fluids is of diagnostic relevance.

Samples can be preamplified prior methylation testing as follows: DNAwas digested with restriction enzyme FspI (and/or Csp6I, and/or MseI,and/or Tsp509I; or their isoschizomeres) and after (heat) inactivationof the restriction enzyme the fragments were circularized using T4 DNAligase. Ligation-products were digested using a mixture of methylationsensitive restriction enzymes. Upon enzyme-inactivation the entiremixture was amplified using rolling circle amplification (RCA) byphi29-phage polymerase. The RCA-amplicons were then directly subjectedto the multiplex-PCRs of the inventive methylation test without furtherneed of digestion of the DNA prior amplification.

Alternatively the preamplified DNA which is enriched for methylated DNAregions can be directly subjected to flourescent-labelling and thelabeled products can be hybridized onto the microarrays using the sameconditions as described above for hybridization of PCR products. Thenthe streptavidin-Cy3 detection step has to be omitted and slides shouldbe scanned directly upon stringency washes and drying the slides. Basedon the experimental design for microarray analyses, either singlelabeled or dual-labeled hybridizations might be generated. From ourexperiences we successfully used the single label-design for classcomparisons. Although the preamplification protocol enables analyses ofspurious amounts of DNA, it is also suited for performing genomicmethylation screens.

To elucidate methylation biomarkers for prediction of metastasis risk ona genomewide level we subjected 500 ng of DNA derived from primary tumorsamples to amplification of the methylated DNA using the procedureoutlined above. RCA-amplicons derived from metastasised andnon-metastasised samples were labelled using the CGH Labeling Kit (Enzo,Farmingdale, N.Y.) and labelled products hybridized onto human 244 k CpGisland arrays (Agilent, Waldbronn, Germany). All manipulations wereaccording the instructions of the manufacturers.

Example 8 Data Analysis

Hybridizations performed on a chip with probes for the inventive 360marker genes were scanned using a GenePix 4000A scanner (MolecularDevices, Ismaning, Germany) with a PMT setting to 700V/cm (equal forboth wavelengths). Raw image data were extracted using GenePix 6.0software (Molecular Devices, Ismaning, Germany).

Microarray data analyses were performed using BRB-ArrayTools developedby Dr. Richard Simon and BRB-ArrayTools Development Team. The softwarepackage BRB Array Tools (version 3.6; in the www atlinus.nci.nih.gov/BRB-ArrayTools.html) was used accordingrecommendations of authors and settings used for analyses are delineatedin the results if appropriate. For every hybridization, backgroundintensities were subtracted from foreground intensities for each spot.Global normalization was used to median center the log-ratios on eacharray in order to adjust for differences in spot/label intensitites.

P-values (p) used for feature selection for classification andprediction were based on the univariate significance levels (alpha).P-values (p) and mis-classification rate during cross validation (MCR)were given along the result data.

Example 9 Lung Cancer Test

DNA methylation analysis of 96 DNA samples derived from both normal andlung-tumour tissue of 48 patient samples and 8 DNA samples isolated fromperipheral blood (PB) of healthy individuals were analysed formethylation deviations in the inventive set of 359 genes.

From this analysis DNA-methylation-biomarkers suitable for distinctionof tumour and normal lung DNA as well as DNA-methylation-profiles fromblood DNA of healthy controls were deduced. Diagnostic and prognosticmarkers subsets are suitable for diagnostic testing and presymptomaticscreening for early detection of lung cancer were determined, in DNAderived from lung tissue, but also in DNA extracts from patients otherthan lung, like sputum, serum or plasma.

DNA Methylation testing results and data analyses of chip results aswell as qPCR validation of a subset of markers derived from chip-basedtesting are provided.

DNA Samples analysed were from blood of 8 healthy individuals (PB), 19tumours (AdenoCa, adenocarcinoma) and 19 normal lung tissue (N) ofadenocarcinoma patients and 29 tumours (SqCCL, squamous cell carcinoma)and 29 normal lung tissue (N) of squamous cell carcinoma patients.

For DNA methylation testing 600 ng of DNA were digested and data derivedfrom DNA-microarray hybridizations analysed using the BRB array toolsstatistical software package. Class comparison, and class predictionanalysis were performed with respect to sample groups as listed above orfor delineation of biomarkers for tumour samples both AdenoCa and SqCCLwere treated as one tumour sample group (TU).

The design of the test enables methylation testing on DNA directlyderived from the biological source. The test is also suitable for usinga DNA preamplification upon MSRE digestion (as outlined above). Thususing the methylation specific preamplification of minute amounts of DNAsamples, biomarker testing is feasible on small samples and limitedamounts of DNA. Thus multiplexed PCR and methylation testing is easilyperformed on preamplified DNA obtained from these DNA samples. Thisstrategy would improve also testing of serum, urine, stool, synovialfluid, sputum and other body fluids using the conceptual design of themethylation test.

The possibility of preamplification enables also differentialmethylation hybridization of the preamplified DNA itself. This option iswarranted by the design of the test and the probes. Thus using theprobes of the methylation test (or the array) for hybridization oflabelled DNA after enrichment of either the methylated as well as theunmethylated DNA fractions of any DNA sample, can be used formethylation testing omitting the multiplex PCR.

In addition the biomarkers described herein could be applied formethylation testing using alternative approaches, e.g. methylationsensitive PCR and strategies which are sodium-bisulfite DNA deaminationbased and not based on MSRE digestion of DNA. These sets of methylationmarkers are suitable markers for disease-monitoring, -progression,-prediction, therapy-decision and -response.

Example 10 Biomarkers from Microarray-Testing of Patient Samples Example10a Class Comparison: TU vs. Normal: p<0.005, Unpaired Samples; 2 FoldChange

These list of methylation markers were found significant (p<0.005)between TU and N using “unpaired” statistical testing of DNA methylationof 48 tumour samples versus 48 healthy lung tissue samples. Significantmarkers with 2 fold difference of signal intensities of both classeswith p<0.005 are listed.

TABLE 2 Sorted by p-value of the univariate test. Permuta- Geom mean ofGeom mean of Parametric tion p- intensities in intensities in Fold- Genep-value FDR value class 1 class 2 change symbol 1  <1e−07 <1e−07 <1e−071411.8016 13554.578246 0.1041568 WT1 2  <1e−07 <1e−07 <1e−07 85.50692241125.7940428 0.0759525 DLX2 3  <1e−07 <1e−07 <1e−07 852.38500137392.282404 0.1153074 SALL3 4  <1e−07 <1e−07 <1e−07 235.4745892592.5077157 0.3974203 TERT 5  <1e−07 <1e−07 <1e−07 274.9097126833.6648468 0.3297605 PITX2 6  <1e−07 <1e−07 <1e−07 80.5286413265.3042755 0.3035331 HOXA10 7  <1e−07 <1e−07 <1e−07 112.6645619855.6410585 0.1316727 F2R 8   1e−07 4.5e−06  <1e−07 2002.2452679266.6906343 7.507745 CPEB4 9   4e−07 1.46e−05  <1e−07 718.3114624609.4380991 0.1558349 NHLH2 10   4e−07 1.46e−05  <1e−07 10347.81849593603.9811381 2.8712188 SMAD3 11   5e−07 1.65e−05  <1e−07 2993.30546371117.4218527 2.6787604 ACTB 12  2.8e−06 8.49e−05  <1e−07 296.64487113941.769913 0.0752568 HOXA1 13  3.6e−06 0.0001008 <1e−07 2792.069939317199.6551909 0.1623329 BOLL 14  5.9e−06 0.0001342 <1e−07 8664.28405672178.4607085 3.9772506 APC 15 1.21e−05 0.0002591 <1e−07 96.7848387472.6945117 0.2047513 MT1G 16 1.36e−05 0.000275  <1e−07 653.05794032188.6201533 0.298388 PENK 17 1.97e−05 0.0003774 <1e−07 1710.98654064044.9737351 0.4229908 SPARC 18 3.16e−05 0.0005751 <1e−07 1639.128227811.4430136 2.0200164 DNAJA4 19 3.85e−05 0.0006673 <1e−07 114.7065029292.8694482 0.3916643 RASSF1 20 4.28e−05 0.0007081 <1e−07 564.6571983189.2105463 2.9842797 HLA-G 21 4.98e−05 0.0007881  1e−04 1339.8175413446.1370253 3.0031525 ERCC1 22   6e−05 0.00091   1e−04 395.62487051158.1502714 0.3416006 ONECUT2 23 6.58e−05 0.000958  <1e−07 2517.32322461024.0897145 2.4581081 APC 24 8.45e−05 0.0011392 <1e−07 232.2537844701.7843246 0.3309475 ABCB1 25 0.0002382 0.0029898  1e−04 3027.50676411165.5391698 2.5975161 ZNF573 26 0.0003469 0.003946  <1e−07 360.9888133148.6109072 2.4290869 KCNJ15 27 0.0003582 0.0039511  3e−04 1818.11860264147.2970277 0.4383864 ZDHHC11 28 0.0012332 0.01192  0.0013 238.5488592512.9101159 0.465089 SFRP2 29 0.0019349 0.0176076 0.0015 310.55918821215.8855725 0.2554181 GDNF 30 0.002818  0.0227945 0.0022 4930.13688092261.9370298 2.1796084 PTTG1 31 0.0038228 0.0267596 0.0045 2402.9850212974.5347994 2.4657765 SERPINI1 32 0.0039256 0.0269326 0.0031 208.6539745417.3186041 0.4999872 TNFRSF10C The 32 genes are significant at thenominal 0.005 level of the univariate test with the fold change 2 Class1: N; Class 2: T.

Example 10b Class Prediction: TU vs Normal: p<0.005, Unpaired Samples; 2Fold Change

Class prediction using different statistical methods for elucidatingmarker panels enabling best correct classification of TU and N(p<0.005).

Diagonal Compound Linear Support Mean Number Covariate Discriminant3-Nearest Nearest Vector of genes in Predictor Analysis 1-NearestNeighbors Centroid Machines classifier Correct? Correct? NeighborCorrect? Correct? Correct? Mean percent 100 100 98 98 98 98 of correctclassification:

TABLE 3 Composition of classifier: Sorted by t-value Geometric mean ofParametric % CV intensities p-value t-value support (class N/class T)Gene symbol 1   <1e−07 −10.859 100 0.1041568 WT1 2   <1e−07 −7.903 1000.3297605 PITX2 3   <1e−07 −7.314 100 0.1153074 SALL3 4   <1e−07 −7.063100 0.1316727 F2R 5   <1e−07 −7.028 100 0.0759525 DLX2 6   <1e−07 −6.592100 0.3974203 TERT 7   <1e−07 −6.539 100 0.3035331 HOXA10 8   <1e−07−6.495 100 0.7772068 MSH4 9   <1e−07 −6.357 100 0.1558349 NHLH2 10  4e−07 −5.915 100 0.5405671 GNA15 11   4e−07 −5.908 100 0.298388 PENK12  4.2e−06 −5.206 100 0.3916643 RASSF1 13   5e−06 −5.155 100 0.1623329BOLL 14 1.05e−05 −4.935 100 0.0752568 HOXA1 15  3.1e−05 −4.61 1000.3416006 ONECUT2 16 4.26e−05 −4.514 100 0.3309475 ABCB1 17 4.59e−05−4.491 100 0.4229908 SPARC 18 4.96e−05 −4.467 100 0.2047513 MT1G 198.53e−05 −4.301 100 0.6381881 HSPA2 20 0.0002478 −3.966 100 0.465089SFRP2 21 0.0002786 −3.929 100 0.7532617 PYCARD 22 0.0003286 −3.876 1000.6491186 GAD1 23 0.0004296 −3.789 100 0.8137828 C5orf4 24 0.0004695−3.76 100 0.7676414 C5AR1 25 0.0004699 −3.76 100 0.2554181 GDNF 260.0006369 −3.66 100 0.4383864 ZDHHC11 27 0.0008023 −3.584 100 0.8171479SERPINE1 28 0.0009028 −3.544 100 0.6392075 NKX2-1 29 0.0009179 −3.539100 0.5993327 PITX2 30 0.0010255 −3.501 100 0.7691876 C5AR1 31 0.0011267−3.47 100 0.5118859 ZNF256 32 0.0014869 −3.375 100 0.5593175 FAM43A 330.0015714 −3.356 100 0.6862518 SFRP2 34 0.0019233 −3.287 100 0.3698669MT3 35 0.0019731 −3.278 100 0.7715219 SERPINE1 36 0.0019838 −3.276 1000.8088555 CLIC4 37 0.0023911 −3.21 100 0.4999872 TNFRSF10C 38 0.0027742−3.158 92 0.8776257 GABRA2 39 0.0028024 −3.154 92 0.7069999 MTHFR 400.0030868 −3.12 81 0.6837301 ESR2 41 0.0033263 −3.093 79 0.6327604NEUROG1 42 0.0036825 −3.057 67 0.6444277 PITX2 43 0.0044243 −2.99 440.732542 PLAGL1 44 0.004896 −2.953 40 0.4992372 TMEFF2 45 0.00379963.046 65 2.1796084 PTTG1 46 0.0034628 3.079 73 1.1394289 CADM1 470.0024932 3.196 100 1.0870547 S100A8 48 0.0024284 3.205 100 1.3497772EFS 49 0.0020087 3.271 100 1.2801593 JUB 50 0.0017007 3.329 1001.1823596 ITGA4 51 0.0015061 3.371 100 1.5959594 MAGEB2 52 0.00134293.41 100 1.294098 ERBB2 53 0.0011103 3.475 100 1.3485708 SRGN 540.0007894 3.589 100 1.3193821 GNAS 55 0.0007437 3.609 100 1.9621539 TJP256 0.000457 3.769 100 2.4290869 KCNJ15 57 0.0004291 3.789 100 1.3004513SLC25A31 58 0.0001587 4.107 100 2.5975161 ZNF573 59 0.0001331 4.163 1001.4996674 TNFRSF25 60 9.26e−05 4.276 100 2.4581081 APC 61 4.88e−05 4.472100 1.9612086 KCNQ1 62 3.62e−05 4.564 100 1.4971047 LAMC2 63 1.82e−054.77 100 1.5467277 SPHK1 64 1.68e−05 4.794 100 2.0200164 DNAJA4 651.45e−05 4.838 100 3.9772506 APC 66   9e−06 4.979 100 1.388284 MBD2 67 8.6e−06 4.994 100 3.0031525 ERCC1 68  4.5e−06 5.182 100 2.9842797 HLA-G69  4.2e−06 5.202 100 1.7516486 CXADR 70  1.4e−06 5.521 100 1.9112579TP53 71  1.1e−06 5.605 100 2.6787604 ACTB 72   9e−07 5.647 100 1.9365988KL 73   6e−07 5.755 100 2.8712188 SMAD3 74   2e−07 6.05 100 1.4368727HIST1H2AG 75   2e−07 6.115 100 7.507745 CPEB4

Example 10c 4 Greedy Pairs >>92% Correct Using SVM (Support VectorMachine)

Using “4 pairs of methylation markers” derived from greedy pairs classprediction with supportive vector machines enables 92% correctclasssification of TU and N.

Performance of Classifiers During Cross-Validation.

Diagonal Compound Linear Support Covariate Discriminant 3-NearestNearest Vector Predictor Analysis 1-Nearest Neighbors Centroid MachinesCorrect? Correct? Neighbor Correct? Correct? Correct? Mean percent 90 9090 89 91 92 of correct classification:

Performance of the Support Vector Machine Classifier:

Class Sensitivity Specificity PPV NPV N 0.917 0.917 0.917 0.917 T 0.9170.917 0.917 0.917

TABLE 4 Composition of classifier: Sorted by t-value (Sorted by genepairs) Geom mean of Geom mean of Parametric % CV intensities inintensities in Fold- Gene p-value t-value support class 1 class 2 changesymbol 1 <1e−07 −9.452 100 1411.8016 13554.578246 0.1041568 WT1 2 <1e−07−7.222 100 85.5069224 1125.7940428 0.0759525 DLX2 3 <1e−07 −6.648 99852.3850013 7392.282404 0.1153074 SALL3 4 <1e−07 −6.48 70 235.4745892592.5077157 0.3974203 TERT 5 0.0017994 3.213 27 437.7037557 291.8672231.4996674 TNFRSF25 6  5e−07 5.391 100 2993.3054637 1117.42185272.6787604 ACTB 7  4e−07 5.474 76 10347.8184959 3603.9811381 2.8712188SMAD3 8 <1e−07 5.832 98 2002.2452679 266.6906343 7.507745 CPEB4 Class 1:N; Class 2: T.

Example 10d (BRB v3.8) 5 Greedy Pairs

Using “5 pairs of methylation markers” derived from greedy pairs classprediction with supportive vector machines enables 95% correctclasssification of TU and N.

Performance of Classifiers During Cross-Validation:

Diagonal Bayesian Compound Linear Support Compound Mean Number CovariateDiscriminant 3-Nearest Nearest Vector Covariate of genes in PredictorAnalysis 1-Nearest Neighbors Centroid Machines Predictor classifierCorrect? Correct? Neighbor Correct? Correct? Correct? Correct? Meanpercent 92 94 90 94 92 95 95 of correct classification: Note: NA denotesthe sample is unclassified. These samples are excluded in the compuationof the mean percent of correct classification

Performance of the Support Vecor MAchine Classifier:

Class Sensitivity Specificity PPV NPV N 0.958 0.938 0.939 0.957 T 0.9380.958 0.957 0.939

TABLE 5 Composition by classifier: Sorted by t-value (Sorted by genepairs) Geom mean of Geom mean of Parametric % CV intensities inintensities in Fold- Gene p-value t-value support class 1 class 2 changesymbol 1 <1e−07 −9.531 100 1378.5556347 13613.2679786 0.1012656 WT1 2<1e−07 −7.419 100 78.691453 1122.0211285 0.0701337 DLX2 3 <1e−07 −6.702100 832.1044249 7415.7421008 0.1122078 SALL3 4 <1e−07 −6.625 100223.339058 595.0731922 0.3753136 TERT 5 <1e−07 −6.586 100 267.2568518837.2745062 0.3191986 PITX2 6 0.0029082 3.057 35 427.3964613 286.95466941.4894215 TNFRSF25 7 1.26e−05  4.612 70 7297.8279144 3875.96375851.8828421 KL 8  9e−07 5.255 99 2922.8174216 1122.2601272 2.6044028 ACTB9  9e−07 5.266 98 10104.1419624 3617.8969167 2.792822 SMAD3 10  2e−075.603 100 1911.6531674 265.654275 7.1960188 CPEB4 Class 1: N; Class 2:T.

Example 10e Recursive Feature Elimination Method

Using “16 methylation markers” derived from the Recursive FeatureElimination method for class prediction with Diagonal LinearDiscriminant Analysis enables 100% correct classification of TU and N.

Performance of Classifiers During Cross-Validation.

Diagonal Compound Linear Support Mean Number Covariate Discriminant3-Nearest Nearest Vector of genes in Predictor Analysis 1-NearestNeighbors Centroid Machines classifier Correct? Correct? NeighborCorrect? Correct? Correct? Mean percent 98 100 96 96 94 96 of correctclassification:

TABLE 6 Composition of classifier: Sorted by t-value Geometric mean ofParametric % CV intensities p-value t-value support (class N/class T)Gene symbol 1   <1e−07 −10.859 100 0.1041568 WT1 2   <1e−07 −7.903 1000.3297605 PITX2 3   <1e−07 −7.314 98 0.1153074 SALL3 4   <1e−07 −7.02881 0.0759525 DLX2 5   <1e−07 −6.592 98 0.3974203 TERT 6   <1e−07 −6.53998 0.3035331 HOXA10 7  4.2e−06 −5.206 98 0.3916643 RASSF1 8 4.59e−05−4.491 94 0.4229908 SPARC 9 0.0329896 −2.197 88 0.5237754 IRAK2 100.0496307 −2.015 98 0.6640548 ZNF711 11 1.68e−05 4.794 79 2.0200164DNAJA4 12  4.5e−06 5.182 79 2.9842797 HLA-G 13  4.2e−06 5.202 791.7516486 CXADR 14  1.4e−06 5.521 75 1.9112579 TP53 15  1.1e−06 5.605100 2.6787604 ACTB 16   2e−07 6.115 100 7.507745 CPEB4

Example 10f (BRB v3.8) Recursive Feature Elimination Method

Due to some differences in data importing/normalisation repeatedcollation of data for statistics (using BRB v. 3.8) a genelist withminor differences (compared to example 12e) has been calculated formdata, and is as given below:

Performance of Classifiers During Cross-Validation.

Diagonal Compound Linear Support Mean Number Covariate Discriminant3-Nearest Nearest Vector of genes in Predictor Analysis 1-NearestNeighbors Centroid Machines classifier Correct? Correct? NeighborCorrect? Correct? Correct? Mean percent 96 100 96 96 96 96 of correctclassification:

TABLE 7 Composition of classifier: Sorted by t-value Geometric mean ofParametric % CV intensities p-value t-value support (class N/class TU)Gene symbol 1   <1e−07 −10.777 100 0.1012656 WT1 2   <1e−07 −8.046 880.3191986 PITX2 3   <1e−07 −7.336 98 0.1122078 SALL3 4   <1e−07 −7.23285 0.1264427 F2R 5   <1e−07 −6.712 100 0.3753136 TERT 6   <1e−07 −6.52498 0.2930706 HOXA10 7  1.6e−06 −5.49 98 0.3695951 RASSF1 8 3.87e−05−4.543 83 0.4112493 SPARC 9 0.0313421 −2.219 88 0.5143877 IRAK2 100.0366617 −2.151 98 0.6452171 ZNF711 11 0.3333009 0.978 58 1.1102014DRD2 12 4.91e−05 4.471 77 1.9749991 DNAJA4 13 2.25e−05 4.707 751.7030259 CXADR 14  7.4e−06 5.036 88 1.8582045 TP53 15  2.1e−06 5.402100 2.6044028 ACTB 16   5e−07 5.815 100 7.1960188 CPEB4

Example 10g Recursive Geneset for “PB-N-TU” Distinction Using CLASSPrediction

To distinguish PB, N, and TU is of interest when minimal invasivetesting for lung cancer has to be performed using serum- or plasma fromperipheral blood. The markers distinguishing PB, N and TU will be bestsuited therefore. Using “16 methylation markers” derived from theRecursive Feature Elimination method for class prediction with DiagonalLinear Discriminant Analysis enables 91% correct classification.

Performance of Classifiers During Cross-Validation:

Diagonal Linear Discriminant 3-Nearest Nearest Analysis 1-NearestNeighbors Centroid Correct? Neighbor Correct? Correct? Mean percent of91 89 87 88 correct classification:

Performance of the Diagonal Linear Discriminant Analysis Classifier:

Class Sensitivity Specificity PPV NPV N 0.875 0.946 0.933 0.898 PB 10.948 0.615 1 T 0.938 0.982 0.978 0.948

Performance of the 1-Nearest Neighbor Classifier:

Class Sensitivity Specificity PPV NPV N 0.979 0.821 0.825 0.979 PB 0.750.99 0.857 0.979 T 0.833 1 1 0.875

Performance of the 3-Nearest Neighbors Classifier:

Class Sensitivity Specificity PPV NPV N 1 0.75 0.774 1 PB 0.125 1 10.932 T 0.854 1 1 0.889

Performance of the Nearest Centroid Classifier:

Class Sensitivity Specificity PPV NPV N 0.812 0.929 0.907 0.852 PB 10.917 0.5 1 T 0.917 0.982 0.978 0.932

TABLE 8 Composition of classifier: Sorted by p-value Geom mean of Geommean of Geom mean of Parametric % CV intensities in intensities inintensities in Gene p-value t-value support class 1 class 2 class 3symbol 1 <1e−07 65.961 100 1411.8016 335.9542052 13554.578246 WT1 2<1e−07 34.742 100 2993.3054637 240.5599546 1117.4218527 ACTB 3 <1e−0730.862 100 85.5069224 70.3843498 1125.7940428 DLX2 4 <1e−07 30.048 100274.9097126 128.8159291 833.6648468 PITX2 5 <1e−07 28.153 100852.3850013 349.2428569 7392.282404 SALL3 6 <1e−07 23.333 100 80.528641362.0661721 265.3042755 HOXA10 7 <1e−07 21.159 100 235.4745892296.8149796 592.5077157 TERT 8  2e−07 17.8 100 2002.2452679 1697.5965438266.6906343 CPEB4 9 4.3e−06  13.991 100 564.6571983 1254.1750649189.2105463 HLA-G 10 1.54e−05  12.388 100 1710.9865406 1310.52866034044.9737351 SPARC 11 1.9e−05  12.132 100 114.7065029 81.1382549292.8694482 RASSF1 12 6.55e−05  10.614 100 1639.128227 1576.0887022811.4430136 DNAJA4 13 0.0008203 7.63 100 1484.6917542 1429.9219493847.5968076 CXADR 14 0.0008501 7.589 100 11761.052468 9062.16557226153.5665863 TP53 15 0.041843  3.276 100 105.5844903 94.1143599201.5835284 IRAK2 16 0.3946752 0.938 100 483.3048928 567.8776158727.8087385 ZNF711 Class 1: N; Class 2: PB; Class 3: T.

Example 10h Class Prediction “Differentiation”→Poor—Moderate—Well

Distinguishing the grade of differentiation of the tumours could be alsoachieved by DNA methylation marker testing. Although the correctclassification is only about 60% in this example, the lung tumour groups“AdenoCa” and “SqCCL” can be split and used separately for determiningthe grade of tumour-differentiation for better performance.

Performance of Classifiers During Cross-Validation.

Diagonal Linearn Discriminant 3-Nearest Nearest Analysis 1-NearestNeighbors Centroid Correct? Neighbor Correct? Correct? Mean percent 5052 57 62 of correct classification:

TABLE 9 Composition of classifier: Sorted by p-value Geom mean of Geommean of Geom mean of Parametric % CV intensities in intensities inintensities in Gene p-value t-value support class 1 class 2 class 3symbol 1 0.0002337 10.127 100 2426.5840626 190.6171197 840.042225 F2R 20.002636 6.796 100 409.0809522 178.099004 3103.6338503 ZNF256 30.0034931 6.432 100 67.1145733 81.4305823 63.5786575 CDH13 4 0.00446266.118 100 30915.9294466 15055.465308 6829.1471271 SERPINB5 5 0.00823215.35 100 289.011498 400.2767665 163.1721958 KRT14 6 0.0092929 5.2 1002890.2702155 418.2345934 211.3575002 DLX2 7 0.0111512 4.977 10068.3488191 83.3593382 60.6607364 AREG 8 0.0286999 3.846 98 62.190402762.94364 74.3029102 THRB 9 0.0326517 3.696 92 64.7904336 80.159663360.6607364 HSD17B4 10 0.0414877 3.418 62 5631.0373836 2622.63158523310.1373187 SPARC 11 0.0449927 3.325 79 894.5655128 1191.0908574510.2671098 HECW2 12 0.0480858 3.249 40 441.1103703 1018.9640546852.4793505 COL21A1 Class 1: moderate; Class 2: poor; Class 3: well.

Example 10i BinTreePred “Differentiation”→AdenoCa, SqCCL, N, PB

Using Binary Tree prediction (applicable for elucidation of markers formore than 2 classes) provides several sets of predictors which enableclassification of PB, AdenoCa, SqCCL, N. These marker sets could be usedalternatively for classification.

Optimal Binary Tree:

Cross-validation error rates for a fixed tree structure shown below

Mis-classific- Node Group 1 Classes Group 2 Classes ation rate (%) 1Adeno, N, SqCCL PB  0.0 2 AdenoCa, SqCCL N  9.4 3 AdenoCa SqCCL 31.2

Results of Classification, Node 1:

TABLE 10 Composition of classifier (23 genes): Sorted by p-value Geommean of Geom mean of Parametric % CV intensities in intensities in Genep-value t-value support group 1 group 2 symbol 1 <1e−07 11.494 1005370.6044342 241.377309 KL 2 <1e−07 13.624 100 15595.1182874 226.4099812HIST1H2AG 3 <1e−07 14.042 100 15562.4306923 62.0661607 TJP2 4 <1e−0720.793 100 36238.4478078 169.7749739 SRGN 5 <1e−07 8.845 92 2847.6405879176.5970582 CDX1 6 <1e−07 7.452 100 357.4232278 64.4047416 TNFRSF25 7<1e−07 6.909 97 4344.5133099 90.5259025 APC 8 <1e−07 6.607 10038027.3831138 10046.5061814 HIC1 9 <1e−07 6.428 100 1605.6039019115.3436683 APC 10  2e−07 5.611 100 439.58106 107.9138518 GNA15 11 2e−07 5.53 100 1828.8750958 240.5597144 ACTB 12 2.47e−05  4.42 1004374.5147937 335.954606 WT1 13 3.53e−05  −4.327 100 693.90701512419.282873 KRT17 14 4.73e−05  −4.251 100 3086.6035554 8432.6551975AIM1L 15 5.58e−05  −4.207 100 11780.3636838 25260.4242674 DPH1 160.0001755 3.895 96 2120.616338 688.5899191 PITX2 17 0.0005056 3.593 100478.7300449 128.8159563 PITX2 18 0.0012022 −3.332 100 167.4354555461.2140013 KIF5B 19 0.0015431 −3.254 100 865.090709 2041.1567322 BMP2K20 0.0020491 −3.164 100 10857.4258468 26743.6730071 GBP2 21 0.00236033.119 100 1819.6185255 218.3422479 NHLH2 22 0.0040506 2.941 96614.495327 62.0661607 GDNF 23 0.0043281 2.918 98 6929.8366248784.5416613 BOLL

Results of Classification, Node 2:

TABLE 11 Composition of classifier (32 genes): Sorted by p-value Geommean of Geom mean of Parametric % CV intensities in intensities in Genep-value t-value support group 1 group 2 symbol 1  <1e−07 9.452 9213554.5792299 1411.801824 WT1 2  <1e−07 7.222 92 1125.7939487 85.5069135DLX2 3  <1e−07 6.648 69 7392.2771156 852.3852836 SALL3 4  <1e−07 6.48 92592.5077475 235.4746794 TERT 5  <1e−07 6.445 92 833.6646395 274.909652PITX2 6  <1e−07 6.123 92 265.3043233 80.5286481 HOXA10 7  <1e−07 6.01992 855.6411657 112.6645794 F2R 8  <1e−07 −5.832 92 266.69078512002.2457379 CPEB4 9   4e−07 5.482 92 4609.4395265 718.3111003 NHLH2 10  4e−07 −5.474 92 3603.9808376 10347.8149677 SMAD3 11   5e−07 −5.391 921117.4212918 2993.3062317 ACTB 12  2.8e−06 4.984 92 3941.7717994296.6448908 HOXA1 13  3.6e−06 4.922 92 17199.6559171 2792.0695552 BOLL14  5.9e−06 −4.802 92 2178.4609569 8664.280092 APC 15 1.21e−05 4.622 92472.6943985 96.784825 MT1G 16 1.36e−05 4.593 69 2188.6204084 653.0580827PENK 17 1.97e−05 4.497 92 4044.9730493 1710.9865557 SPARC 18 3.16e−05−4.373 92 811.4434055 1639.128128 DNAJA4 19 3.85e−05 4.321 92 292.869462114.7064501 RASSF1 20 4.28e−05 −4.293 92 189.210499 564.6573579 HLA-G 214.98e−05 −4.253 92 446.1371701 1339.8173509 ERCC1 22   6e−05 4.203 921158.1503785 395.6249449 ONECUT2 23 6.58e−05 −4.178 92 1024.0896142517.3225611 APC 24 8.45e−05 4.11 92 701.7840426 232.2538242 ABCB1 250.0002382 −3.821 92 1165.5392514 3027.5052576 ZNF573 26 0.0003469 −3.71392 148.6108699 360.9887854 KCNJ15 27 0.0003582 3.704 92 4147.29872141818.1188972 ZDHHC11 28 0.0012332 3.332 46 512.9098469 238.5488699 SFRP229 0.0019349 3.19 92 1215.8855046 310.5592635 GDNF 30 0.002818  −3.06892 2261.9371454 4930.1357863 PTTG1 31 0.0038228 −2.966 92 974.53459022402.9849125 SERPINI1 32 0.0039256 2.957 90 417.3184202 208.6541481TNFRSF10C

Results of Classification, Node 3:

TABLE 12 Composition of classifier (2 genes): Sorted by p-value Geommean of Geom mean of Parametric % CV intensities in intensities in Genep-value t-value support group 1 group 2 symbol 1 0.000302 3.91 40584.5327307 158.116767 HOXA10 2 0.0038089 3.048 46 180.347456167.1158875 NEUROD1

Example 11 qPCR Validation of Biomarkers

Quantitative PCR with primers for markers elucidated by microarrayanalysis were run on MSRE-digested DNAs from the same sample groups asanalyzed on microarrays. Marker sets for SYBRGreen qPCR were fromExample 10f and Example 10d.

TABLE 13 Markers used for SYBRGreen-qPCR: Unique id Gene symbolAhy_61_chr11: 32411664-32412266 +_401-464 WT1 349_hy_35-PitxA_chr4:111777754-111778067 PITX2 Ahy_156_chr18: 74841510-74841935 +_336-389SALL3 Ahy_265_chr5: 76046889-76047178 +_134-197 F2R Ahy_252_chr5:1348529-1348893 +_138-187 TERT Ahy_289_chr7: 27180142-27180796 +_181-238HOXA10 Ahy_233_chr3: 50352877-50353278 +_108-157 RASSF1 Ahy_257_chr5:151046476-151047183 +_57-106 SPARC Ahy_212_chr3: 10181572-10181986+_249-298 IRAK2 Ahy_332_chrX: 84385510-84385717 +_42-106 ZNF711Ahy_51_chr11: 112851438-112851650 +_57-107 DRD2 Ahy_109_chr15:76343347-76343876 +_373-428 DNAJA4 Ahy_202_chr21: 17806218-17806561+_104-167 CXADR Ahy_143_chr17: 7532353-7532949 +_415-476 TP53335_hy_4-Aktin_VL_chr7: 5538506-5538805 ACTB Ahy_261_chr5:173247753-173248208 +_350-404 CPEB4 Ahy_181_chr2: 172672873-172673656+_177-227 DLX2 Ahy_30_chr1: 6448693-6448938 +_57-107 TNFRSF25Ahy_83_chr13: 32489371-32489688 +_181-245 KL Ahy_107_chr15:65146236-65146654 +_305-366 SMAD3

Negative amplification (no Cp-value generated upon 45 cycles of PCRamplification with SYBR green) were set to Cp=45; all qPCR-Cp-valueswere subtracted from 45.01 to obtain transformed data directlycomparable to microarray data,—thus the higher the value the moreproduct was generated (resembles a lower Cp-value. Statistical testingof the transformed data was performed in the same manner as themicroarray data using BRB-AT software.

Class comparison and different strategies/methods for class predictionusing the qPCR enables correct classification of different samplegroups. Although qPCR conditions were not optimized but run under ourstandard conditions, successful classification of groups with markersdeduced from microarrayanalysis confirms reliability of methylationmarkers.

TABLE 14 9 markers from Table 13 showed significant class differencefold changes mean of mean of log log intensities intensities Unique idGene symbol for N for T FoldDiff Ahy_30_chr1: TNFRSF25 7.40354 8.51250.46 6448693-6448938 +_57-107 Ahy_156_chr18: SALL3 1.59063 7.04229 0.0274841510-74841935 +_336-389 Ahy_233_chr3: RASSF1 5.80167 7.95708 0.2250352877-50353278 +_108-157 Ahy_252_chr5: TERT 0.01 1.1725 0.451348529-1348893 +_138-187 Ahy_257_chr5: SPARC 11.76 14.10521 0.20151046476-151047183 +_57-106 Ahy_265_chr5: F2R 0.70917 4.87917 0.0676046889-76047178 +_134-197 Ahy_289_chr7: HOXA10 1.67708 3.88125 0.2227180142-27180796 +_181-238 Ahy_332_chrX: ZNF711 4.635 6.48875 0.2884385510-84385717 +_42-106 349_hy_35- PITX2 5.48854 8.61813 0.11PitxA_chr4: 111777754-111778067

Example 11a CLASS Prediction: TU vs Normal: p<0.01>>SVM 100%, PairedSamples Performance of Classifiers During Cross-Validation MeanPercentage of Correction Classification:

Diagonal Compound Linear Support Covariate Discriminant 3-NearestNearest Vector Predictor Analysis 1-Nearest Neighbors Centroid MachinesCorrect? Correct? Neighbor Correct? Correct? Correct? Mean percent 96 9894 94 94 100 of correct classification: n = 48

TABLE 15 Composition of classifier: Sorted by t-value Geometric meanParametric % CV of intensities p-value t-value support (class N/class T)Gene symbol 1 1e−07 −6.184 100 0.0228499 SALL3 2 2e−07 −6.162 1000.1142619 PITX2 3 4e−07 −5.879 100 0.1967986 SPARC 4 3.5e−06   −5.254100 0.0555527 F2R 5 8.08e−05 −4.318 100 0.4467377 TERT 6 0.0009183−3.538 100 0.2244683 RASSF1 7 0.0011335 −3.468 100 0.21701 HOXA10 80.0045818 2.978 100 1.7787126 CXADR 9 0.0012761 3.427 100 3.3134481 KL

Example 11b CLASS Prediction: TU vs Normal: p<0.01 Performance of theSupport Vector Machine Classifier:

Class Sensitivity Specificity PPV NPV N 0.917 0.875 0.88 0.913 T 0.8750.917 0.913 0.88

Performance of the Bayesian Compound Covariate Classifier:

Class Sensitivity Specificity PPV NPV N 0.792 0.604 0.667 0.744 T 0.6040.792 0.744 0.667

TABLE 16 Composition of classifier: Sorted by t-value Geom mean of Geommean of Parametric % CV intensities in intensities in Fold- Gene p-valuet-value support class 1 class 2 change Unique id symbol 1 <1e−07 −6.713100 3.011798 131.8077746 0.0228499 Ahy_156_chr18:74841510- SALL374841935 +_336-389 2 <1e−07 −6.491 100 3468.2688243 17623.44464060.1967986 Ahy_257_chr5:151046476- SPARC 151047183 +_57-106 3 <1e−07−6.208 100 44.8968301 392.9290497 0.1142619 349_hy_35-PitxA_chr4: PITX2111777754-111778067 4  1e−06 −5.248 100 1.6348595 29.429 0.0555527Ahy_265_chr5:76046889- F2R 76047178 +_134-197 5 3.91e−05  −4.318 1001.0069555 2.2540195 0.4467377 Ahy_252_chr5:1348529- TERT 1348893+_138-187 6 0.0003748 −3.691 100 55.7796365 248.4967761 0.2244683Ahy_233_chr3:50352877- RASSF1 50353278 +_108-157 7 0.0009309 −3.419 1003.1978081 14.7357642 0.21701 Ahy_289_chr7:27180142- HOXA10 27180796+_181-238 8 0.0009772 3.404 100 3114.5146028 939.9618007 3.3134481Ahy_83_chr13:32489371- KL 32489688 +_181-245 Class 1: N; Class 2: T.

TABLE 16b Prediction rule from the linear predictors The prediction ruleis defined by the inner sum of the weights (wi) and expression (xi) ofsignificant genes. The expression is the log ratios for dual-channeldata and log intensities for single-channel data. A sample is classifiedto the class N if the sum is greater than the threshold; that is, Σiwixi > threshold. The threshold for the Compound Covariate predictor is−172.255 The threshold for the Diagonal Linear Discriminant predictor is−15.376 The threshold for the Support Vector Machine predictor is 0.838Diagonal Table. Compound Linear Support Gene Covariate DiscriminantVector Weights Genes Predictor Analysis Machines 1 Ahy_83_chr13: 3.40410.2794 1.2796 32489371-32489688 +_181-245 2 Ahy_156_chr18: −6.7126−0.3444 −0.2136 74841510-74841935 +_336-389 3 Ahy_233_chr3: −3.6907−0.2633 0.0512 50352877-50353278 +_108-157 4 Ahy_252_chr5: −4.3175−0.6681 −1.1674 1348529-1348893 +_138-187 5 Ahy_257_chr5: −6.4911−0.7486 −0.7093 151046476-151047183 +_57-106 6 Ahy_265_chr5: −5.2477−0.2752 −0.0135 76046889-76047178 +_134-197 7 Ahy_289_chr7: −3.419−0.221 −0.3187 27180142-27180796 +_181-238 8 349_hy_35- −6.2083 −0.5132−0.353 PitxA_chr4: 111777754-111778067

Example 11c Recursive Feature Extraction (n=10) Prediction: TU vsNormal→98% Correct, Paired Samples

TABLE 17 Composition of classifiers: Sorted by t-value Geometric meanParametric % CV of intensities p-value t-value support (class N/class T)Gene symbol 1 1e−07 −6.184 100 0.0228499 SALL3 2 2e−07 −6.162 1000.1142619 PITX2 3 4e−07 −5.879 100 0.1967986 SPARC 4 3.5e−06   −5.254100 0.0555527 F2R 5 0.0011335 −3.468 100 0.21701 HOXA10 6 0.0188086−2.434 92 0.5671786 DRD2 7 0.3539709 0.936 94 1.2886257 ACTB 8 0.10839211.637 100 1.8305684 DNAJA4 9 0.0045818 2.978 98 1.7787126 CXADR 100.0012761 3.427 100 3.3134481 KL

Example 11d Greedy Pairs (6) Prediction: TU vs Normal: 88% SVM, UNpairedSamples Performance of the Support Vector Machine Classifier:

Class Sensitivity Specificity PPV NPV N 0.896 0.854 0.86 0.891 T 0.8540.896 0.891 0.86

Performance of the Bayesian Compound Covariate Classifier:

Class Sensitivity Specificity PPV NPV N 0.812 0.604 0.672 0.763 T 0.6040.812 0.763 0.672

TABLE 18 Composition fo classifier: Sorted by t-value (Sorted by genepairs) Geom mean of Geom mean of Parametric % CV intensities inintensities in Fold- Gene p-value t-value support class 1 class 2 changesymbol 1 <1e−07 −6.713 100 3.011798 131.8077746 0.0228499 SALL3 2 <1e−07−6.491 100 3468.2688243 17623.4446406 0.1967986 SPARC 3 <1e−07 −6.208100 44.8968301 392.9290497 0.1142619 PITX2 4  1e−06 −5.248 100 1.634859529.429 0.0555527 F2R 5 3.91e−05  −4.318 100 1.0069555 2.25401950.4467377 TERT 6 0.0003748 −3.691 100 55.7796365 248.4967761 0.2244683RASSF1 7 0.0009309 −3.419 100 3.1978081 14.7357642 0.21701 HOXA10 80.0137274 −2.512 100 169.3121483 365.1891236 0.4636287 TNFRSF25 90.1465343 1.464 98 4255.1669082 2324.5057894 1.8305684 DNAJA4 100.1463194 1.465 50 326.8534389 203.1873409 1.6086309 TP53 11 0.01763452.416 100 2588.5288498 1455.2822633 1.7787126 CXADR 12 0.0009772 3.404100 3114.5146028 939.9618007 3.3134481 KL Class 1: N; Class 2: T.Cross-Validation ROC curve from the Bayesian Compound CovariatePredictor. The area under the curve is 0.944 (FIG. 1).

Example 11e CLASS Prediction: Histology: p<0.05 Using all qPCRs forClass Prediction Analysis of Tumor-Subtype Versus Normal Lung Tissue

TABLE 19 Composition of classifier: Sorted by p-value Geom mean of Geommean of Geom mean of Parametric % CV intensities in intensities inintensities in Gene p-value t-value support class 1 class 2 class 3symbol 1  <1e−07 23.305 100 11832.9848147 3468.2688243 22878.8045137SPARC 2  <1e−07 22.546 100 98.6115161 3.011798 159.4048479 SALL3 3  1e−07 19.146 100 7.6044403 1.6348595 71.4209691 F2R 4   1e−07 19.124100 359.9316118 44.8968301 416.1715345 PITX2 5 2.81e−05 11.753 10090.8736104 55.7796365 480.3462809 RASSF1 6 3.15e−05 11.611 10048.8581148 3.1978081 6.7191365 HOXA10 7 0.0001543 9.66 100 1.96027031.0069555 2.4699516 TERT 8 0.0042218 5.802 100 1047.8074626 3114.5146028875.3966524 KL 9 0.0233243 3.914 100 263.7738716 169.3121483 451.9439364TNFRSF25 Class 1: AdenoCa; Class 2: N; Class 3: SqCCL.

Performance of Classifiers During Cross-Validation

Mean percent of correct classification, n=96:

Mean percent of correct classification, n = 96: Diagonal Linear3-Nearest Nearest Discriminant 1-Nearest Neighbors Centroid AnalysisCorrect? Neighbor Correct? Correct? Mean percent of 72 74 74 72 correctclassification:

Example 11f: Bintree Prediction: Histology—p<0.05 UNpaired Samples“Compound Covariate Classifier” Optimal Binary Tree: Cross-ValidationError Rates for a Fixed Tree Structure Shown Below

Group 1 Group 2 Mis-classification Node Classes Classes rate (%) 1AdenoCa, N 14.6 SqCCL 2 AdenoCa SqCCL 31.2

Results of Classification, Node 1:

TABLE 20 Composition of classifiers (10 genes): Sorted by p-value Geommean of Geom mean of Parametric % CV intensities in intensities in Genep-value t-value support group 1 group 2 symbol 1 <1e−07 6.713 100131.8077753 3.011798 SALL3 2 <1e−07 6.491 100 17623.4448347 3468.2687994SPARC 3 <1e−07 6.208 100 392.9290438 44.8968296 PITX2 4  1e−06 5.248 10029.4290011 1.6348595 F2R 5 3.91e−05  4.317 100 2.2540195 1.0069556 TERT6 0.0003748 3.691 100 248.4967776 55.779638 RASSF1 7 0.0009309 3.419 10014.7357644 3.197808 HOXA10 8 0.0009772 −3.404 100 939.96181083114.5147006 KL 9 0.0137274 2.511 100 365.1891266 169.3121466 TNFRSF2510 0.0176345 −2.416 100 1455.2823102 2588.528822 CXADR

Results of Classification, Node 2:

TABLE 21 Composition of classifier (3 genes): Sorted by p-value Geommean of Geom mean of Parametric % CV intensities in intensities in Genep-value t-value support group 1 group 2 symbol 1 0.0058346 2.892 5048.8581156 6.7191366 HOXA10 2 0.0253305 −2.312 50 90.8736092 480.3462899RASSF1 3 0.0330755 −2.197 49 7.6044405 71.4209719 F2R

1.-15. (canceled)
 16. A nucleic acid primer or hybridization probe setspecific for at least one potentially methylated region of at least onemarker gene suitable to diagnose or predict lung cancer or a lung cancertype.
 17. The set of claim 16, wherein the at least one the marker geneis further defined as WT1, SALL3, TERT, ACTB, or CPEB4.
 18. The set ofclaim 16, wherein the lung cancer is adenocarcinoma or squamous cellcarcinoma.
 19. The set of claim 16, further comprising a nucleic acidprimer or hybridization probe specific for at least one additionalmarker gene defined as ABCB 1, ACTB, AIM1L, APC, AREG, BMP2K, BOLL,C5AR1, C5orf4, CADM1, CDH13, CDX1, CLIC4, COL21A1, CPEB4, CXADR, DLX2,DNAJA4, DPH1, DRD2, EFS, ERBB2, ERCC1, ESR2, F2R, FAM43A, GABRA2, GAD1,GBP2, GDNF, GNA15, GNAS, HECW2, HIC1, HIST1H2AG, HLAG, HOXA1, HOXA10,HSD17B4, HSPA2, IRAK2, ITGA4, JUB, KCNJ15, KCNQ1, KIF5B, KL, KRT14,KRT17, LAMC2, MAGEB2, MBD2, MSH4, MT1G, MT3, MTHFR, NEUROD1, NHLH2,NKX2-1, ONECUT2, PENK, PITX2, PLAGL1, PTTG1, PYCARD, RASSF1, S100A8,SALL3, SERPINB5, SERPINE1, SERPINI1, SFRP2, SLC25A31, SMAD3, SPARC,SPHK1, SRGN, TERT, THRB, TJP2, TMEFF2, TNFRSF10C, TNFRSF25, TP53,ZDHHCI1, ZNF256, ZNF711, F2R, HOXA10, KL, SALL3, SPARC, TNFRSF25, orWT1.
 20. The set of claim 16, further defined as a nucleic acid primeror hybridization probe set comprising nucleic acid primers orhybridization probes being specific for potentially methylated regionsof at least 50% of the marker genes in at least one of the followingcombinations: WT1, DLX2, SALL3, TERT, PITX2, HOXA10, F2R, CPEB4, NHLH2,SMAD3, ACTB, HOXA1, BOLL, APC, MT1G, PENK, SPARC, DNAJA4, RASSF1, HLA-G,ERCC1, ONECUT2, APC, ABCB1, ZNF573, KCNJ15, ZDHHC11, SFRP2, GDNF, PTTG1,SERPINI1, and TNFRSF10C; WT1, PITX2, SALL3, F2R, DLX2, TERT, HOXA10,MSH4, NHLH2, GNA15, PENK, RASSF1, BOLL, HOXA1, ONECUT2, ABCB1, SPARC,MT1G, HSPA2, SFRP2, PYCARD, GAD1, C5orf4, C5AR1, GNDF, ZDHHC11,SERPINE1, NKX2-1, PITX2, C5AR1, GDNF, ZDHHC11, SERPINE1, NKX2-1, PITX2,C5AR1, ZNF256, FAM43A, SFRP2, MT3, SERPINE1M, CLIC4, TNFRSF10C, GABRA2,MTHFR, ESR2, NEUROG1, PITX2, PLAGL1, TMEFF2, PTTG1, CADM1, S100A8, EFS,JUB, ITGA4, MAGEB2, ERBB2, SRGN, GNAS, TJP2, KCNJ15, SLC25A31, ZNF573,TNFRSF25, APC, KCNQ1, LAMC2, SPHK1 DNAJA4, APC, MBD2, ERCC1 HLA-G,CXADR, TP53, ACTB, KL, SMAD3, HIST1H2AG, and CPEB4; WT1 DLX2, SALL3,TERT, TNFRSF25, ACTB, SMAD3, and CPEB4; WT1, DLX2, SALL3, TERT, PITX2,TNFRSF25, KL, ACTB, SMAD3, and CPEB4; WT1, PITX2, SALL3, DLX2, TERT,HOXA10, RASSF1, SPARC, IRAK2, ZNF711, DNAJA4, HLA-Q, CXADR, TP53, ACTB,and CPEB4; WT1, PITX2, SALL3, F2R, TERT, HOXA10, RASSF1, SPARC, IRAK2,ZNF711, DRD2, DNAJA4, CXADR, TP53, ACTB, and CPEB4; WT1, ACTB, DLX2,PITX2, SALL3, HOXA10, TERT, CPEB4, HLA-G, SPARC, RASSF1, DNAJA4, CXADR,TP53, IRAK2, and ZNF711; F2R, ZNF256, CDH13, SERPINB5, KRT14, DLX2,AREG, THRB, HSD17B4, SPARC, HECW2, and COL21A1; KL, HIST1H2AG, TJP2,SRGN, CDX1, TNFRSF25, APC, HIC1, APC, GNA15, ACTB, WT1, KRT17, AIM1L,DPH1, PITX2, PITX2, KIF5B, BMP2K, GBP2, NHLH2, GDNF, and BOLL; WT1,DLX2, SALL3, TERT, PITX2, HOXA10, F2R, CPEB4, NHLH2, SMAD3, ACTB, HOXA1,BOLL, APC, MT1G, PENK, SPARC, DNAJA4, RASSF1, HLA-G, ERCC1, ONECUT2,APC, ABCB1, ZNF573, KCNJ15, ZDHHC11, SFRP2, GDNF, PTTG1, SERPINI1, andTNFRSF10C; HOXA10 and NEUROD1; WT1, PITX2, SALL3, F2R, TERT, HOXA10,RASSF1, SPARC, IRAK2, ZNF711, DRD2, DNAJA4, CXADR, TP53, ACTB, CPEB4,DLX2, TNFRSF25, KL, and SMAD3; TNFRSF25, SALL3, RASSF1, TERT, SPARC,F2R, HOXA10, ZNF711, and PITX2 SALL3, PITX2, SPARC, F2R, TERT, RASSF1,HOXA10, CXADR, and KL SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10,and KL; SALL3, PITX2, SPARC, F2R, HOXA10, DRD2, ACTB, DNAJA4, CXADR, KL;SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, TNFRSF25, DNAJA4, TP53,CXADR, and KL; SPARC, SALL3, F2R, PITX2, RASSF1, HOXA10, TERT, KL, andTNFRSF25; SALL3, SPARC, PITX2, F2R, TERT, RASSF1, HOXA10, KL, TNFRSF25,CXADR; and HOXA10, RASSF1, and F2R.
 21. The set of claim 20, furtherdefined as comprising nucleic acid primers or hybridization probes beingspecific for potentially methylated regions of at least 60% of themarker genes in at least one of the combinations.
 22. The set of claim21, further defined as comprising nucleic acid primers or hybridizationprobes being specific for potentially methylated regions of at least 70%of the marker genes in at least one of the combinations.
 23. The set ofclaim 22, further defined as comprising nucleic acid primers orhybridization probes being specific for potentially methylated regionsof at least 80% of the marker genes in at least one of the combinations.24. The set of claim 23, further defined as comprising nucleic acidprimers or hybridization probes being specific for potentiallymethylated regions of at least 90% of the marker genes in at least oneof the combinations.
 25. The set of claim 24, further defined ascomprising nucleic acid primers or hybridization probes being specificfor potentially methylated regions of 100% of the marker genes in atleast one of the combinations.
 26. The set of claim 16, further definedas comprising not more than 100000 probes or primer pairs.
 27. The setof claim 26, further defined as comprising immobilized probes on a solidsurface.
 28. The set of claim 26, wherein the primer pairs and probesare specific for a methylated upstream region of an open reading frameof the marker genes.
 29. The set of claim 26, wherein the probes orprimers are specific for methylation in the genetic regions defined byany of SEQ ID NOs 1081 to 1440 including the adjacent up to 500 basepairs corresponding to any of gene marker IDs 1 to
 359. 30. The set ofclaim 29, wherein the probes or primers are specific for methylation inthe genetic regions defined by any of SEQ ID NOs 1081 to 1440 includingthe adjacent up to 300 base pairs corresponding to any of gene markerIDs 1 to
 359. 31. The set of claim 29, wherein the probes or primers arespecific for methylation in the genetic regions defined by any of SEQ IDNOs 1081 to 1440 including the adjacent up to 200 base pairscorresponding to any of gene marker IDs 1 to
 359. 32. The set of claim29, wherein the probes or primers are specific for methylation in thegenetic regions defined by any of SEQ ID NOs 1081 to 1440 including theadjacent up to 100 base pairs corresponding to any of gene marker IDs 1to
 359. 33. The set of claim 29, wherein the probes or primers arespecific for methylation in the genetic regions defined by any of SEQ IDNOs 1081 to 1440 including the adjacent up to 50 base pairscorresponding to any of gene marker IDs 1 to
 359. 34. The set of claim29, wherein the probes or primers are specific for methylation in thegenetic regions defined by any of SEQ ID NOs 1081 to 1440 including theadjacent up to 10 base pairs corresponding to any of gene marker IDs 1to
 359. 35. The set of claim 29, wherein the probes or primers are ofSEQ ID NOs 1 to
 1080. 36. A method of identifying or predicting a lungcancer or a lung cancer type in a patient, comprising: obtaining asample comprising DNA from a patient; obtaining a set of nucleic acidprimers or hybridization probes of claim 16; using the set to determinemethylation status of genes in the sample for which the members of theset are specific; and comparing the methylation status of the genes withthe status of a confirmed lung cancer type positive and/or negativestate, thereby identifying lung cancer or lung cancer type, if any, inthe patient.
 37. The method of claim 36, wherein the methylation statusis determined by methylation specific PCR analysis, methylation specificdigestion analysis and either or both of hybridization analysis tonon-digested or digested fragments or PCR amplification analysis ofnon-digested or digested fragments.
 38. A method of determining a subsetof diagnostic markers for potentially methylated genes from the genes ofgene marker IDs 1-359 of Table 1, suitable for the diagnosis orprognosis of lung cancer or lung cancer type, comprising: a) obtainingdata of the methylation status of at least 50 random genes selected fromthe 359 genes of gene marker IDs 1-359 in at least 1 sample of aconfirmed lung cancer or lung cancer type state and at least one sampleof a lung cancer or lung cancer type negative state; b) correlating theresults of the obtained methylation status with the lung cancer or lungcancer type; c) optionally repeating the obtaining a) and correlating b)steps for a different combination of at least 50 random genes selectedfrom the 359 genes of gene marker IDs 1-359; and d) selecting as manymarker genes which in a classification analysis have a p-value of lessthan 0.1 in a random-variance t-test, or selecting as many marker geneswhich in a classification analysis together have a correct lung canceror lung cancer type prediction of at least 70% in a cross-validationtest; wherein the selected markers form the subset of diagnosticmarkers.
 39. The method of claim 38, wherein a) is further defined ascomprising obtaining data of the methylation status of at least 50random genes selected from the 359 genes of gene marker IDs 1-359 in atleast 5 samples of a confirmed lung cancer or lung cancer type state.40. The method of claim 38, wherein the correlated results for each geneb) are rated by their correct correlation to the disease or tumor typepositive state, preferably by p-value test, and selected in step d) inorder of the rating.
 41. The method of claim 38, wherein the at least 50genes of step a) are at least 100 genes.
 42. The method of claim 41,wherein the at least 100 genes of step a) are at least 250 genes. 43.The method of claim 42, wherein the at least 250 genes of step a) areall of the genes.
 44. The method of claim 38, wherein not more than 40marker genes are selected in step d) for the subset.
 45. The method ofclaim 38, wherein the step a) of obtaining data of the methylationstatus comprises determining data of the methylation status bymethylation specific PCR analysis, methylation specific digestionanalysis, or hybridization analysis to non-digested or digestedfragments, or PCR amplification analysis of non-digested or digestedfragments.
 46. A method of identifying or predicting a lung cancer or alung cancer type in a patient, comprising: obtaining a sample comprisingDNA from a patient; providing a set of a diagnostic subset of markersidentified by a method of claim 38; using the set to determinemethylation status of genes in the sample for which the members of theset are specific; and comparing the methylation status of the genes withthe status of a confirmed lung cancer type positive and/or negativestate, thereby identifying lung cancer or lung cancer type, if any, inthe patient.
 47. The method of claim 46, wherein the methylation statusis determined by methylation specific PCR analysis, methylation specificdigestion analysis and either or both of hybridization analysis tonon-digested or digested fragments or PCR amplification analysis ofnon-digested or digested fragments.